SlideShare a Scribd company logo
1 of 46
Apache Spark
Fernando Rodriguez Olivera
@frodriguez
Buenos Aires, Argentina, Nov 2014
JAVACONF 2014
Fernando Rodriguez Olivera
Twitter: @frodriguez
Professor at Universidad Austral (Distributed Systems, Compiler
Design, Operating Systems, …)
Creator of mvnrepository.com
Organizer at Buenos Aires High Scalability Group, Professor at
nosqlessentials.com
Apache Spark
Apache Spark is a Fast and General Engine
for Large-Scale data processing
Supports for Batch, Interactive and Stream
processing with Unified API
In-Memory computing primitives
Hadoop MR Limits
Job Job Job
Hadoop HDFS
- Communication between jobs through FS
- Fault-Tolerance (between jobs) by Persistence to FS
- Memory not managed (relies on OS caches)
MapReduce designed for Batch Processing:
Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
Daytona Gray Sort 100TB Benchmark
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Data Size Time Nodes Cores
Hadoop MR
(2013)
102.5 TB 72 min 2,100
50,400
physical
Apache
Spark
(2014)
100 TB 23 min 206
6,592
virtualized
Daytona Gray Sort 100TB Benchmark
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Data Size Time Nodes Cores
Hadoop MR
(2013)
102.5 TB 72 min 2,100
50,400
physical
Apache
Spark
(2014)
100 TB 23 min 206
6,592
virtualized
3X faster using 10X fewer machines
Hadoop vs Spark for Iterative Proc
source: https://spark.apache.org/
Logistic regression in Hadoop and Spark
Apache Spark
Apache Spark (Core)
Spark
SQL
Spark
Streaming
ML lib GraphX
Powered by Scala and Akka
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Partitioned and Distributed
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Partitioned and Distributed
Stored in Memory
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Partitioned and Distributed
Stored in Memory
Partitions Recomputed on Failure
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
e.g: apply
function
to count
chars
Compute
Function
(transformation)
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
depends on
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
depends on
N
Int
Action
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
Partitions
Compute Function
Dependencies
Preferred Compute
Location
(for each partition)
RDD Implementation
Partitioner
depends on
N
Int
Action
Spark API
val spark = new SparkContext()
val lines = spark.textFile(“hdfs://docs/”) // RDD[String]
val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String]
val count = nonEmpty.count
Scala
SparkContext spark = new SparkContext();
JavaRDD<String> lines = spark.textFile(“hdfs://docs/”)
JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0);
long count = nonEmpty.count();
Java8Python
spark = SparkContext()
lines = spark.textFile(“hdfs://docs/”)
nonEmpty = lines.filter(lambda line: len(line) > 0)
count = nonEmpty.count()
RDD Operations
map(func)
flatMap(func)
filter(func)
take(N)
count()
collect()
Transformations Actions
groupByKey()
reduceByKey(func)
reduce(func)
… …
mapValues(func)
takeOrdered(N)
top(N)
Text Processing Example
Top Words by Frequency
(Step by step)
Create RDD from External Data
// Step 1 - Create RDD from Hadoop Text File
val docs = spark.textFile(“/docs/”)
Hadoop FileSystem,
I/O Formats, Codecs
HBaseS3HDFS MongoDB
Cassandra
…
Apache Spark
Spark can read/write from any data source supported by Hadoop
I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop)
ElasticSearch
Function map
Hello World
A New Line
hello
...
The end
.map(line => line.toLowerCase)
RDD[String] RDD[String]
hello world
a new line
hello
...
the end
.map(_.toLowerCase)
// Step 2 - Convert lines to lower case
val lower = docs.map(line => line.toLowerCase)
=
Functions map and flatMap
hello world
a new line
hello
...
the end
RDD[String]
Functions map and flatMap
hello world
a new line
hello
...
the end
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
Functions map and flatMap
hello world
a new line
hello
...
the end
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
.flatten
hello
a
...
world
new
line
RDD[String]
*
Functions map and flatMap
hello world
a new line
hello
...
the end
.flatMap(line => line.split(“s+“))
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
.flatten
hello
a
...
world
new
line
RDD[String]
*
Functions map and flatMap
hello world
a new line
hello
...
the end
.flatMap(line => line.split(“s+“))
Note: flatten() not available in spark, only flatMap
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
// Step 3 - Split lines into words
val words = lower.flatMap(line => line.split(“s+“))
.flatten
hello
a
...
world
new
line
RDD[String]
*
Key-Value Pairs
hello
a
...
world
new
line
hello
hello
a
...
world
new
line
hello
.map(word => Tuple2(word, 1))
1
1
1
1
1
1
.map(word => (word, 1))
RDD[String] RDD[(String, Int)]
// Step 4 - Split lines into words
val counts = words.map(word => (word, 1))
=
RDD[Tuple2[String, Int]]
Pair RDD
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
.reduceByKey((a, b) => a + b)
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
.reduceByKey((a, b) => a + b)
// Step 5 - Count all words
val freq = counts.reduceByKey(_ + _)
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]
Top N (Prepare data)
world
a
1
1
new 1
line
hello
1
2
// Step 6 - Swap tuples (partial code)
freq.map(_.swap)
.map(_.swap)
world
a
1
1
new1
line
hello
1
2
RDD[(String, Int)] RDD[(Int, String)]
Top N (First Attempt)
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
Top N (First Attempt)
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
.sortByKey
RDD[(Int, String)]
hello
world
2
1
a1
new
line
1
1
(sortByKey(false) for descending)
Top N (First Attempt)
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)] Array[(Int, String)]
hello
world
2
1
.take(N).sortByKey
RDD[(Int, String)]
hello
world
2
1
a1
new
line
1
1
(sortByKey(false) for descending)
Top N
Array[(Int, String)]
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
world
a
1
1
.top(N)
hello
line
2
1
hello
line
2
1
local top N *
local top N *
reduction
// Step 6 - Swap tuples (complete code)
val top = freq.map(_.swap).top(N)
* local top N implemented by bounded priority queues
val spark = new SparkContext()
// RDD creation from external data source
val docs = spark.textFile(“hdfs://docs/”)
// Split lines into words
val lower = docs.map(line => line.toLowerCase)
val words = lower.flatMap(line => line.split(“s+“))
val counts = words.map(word => (word, 1))
// Count all words (automatic combination)
val freq = counts.reduceByKey(_ + _)
// Swap tuples and get top results
val top = freq.map(_.swap).top(N)
top.foreach(println)
Top Words by Frequency (Full Code)
RDD Persistence (in-memory)
…
...
...
...
...
…
…
…
...
RDD
.cache()
.persist()
.persist(storageLevel)
StorageLevel:
MEMORY_ONLY,
MEMORY_ONLY_SER,
MEMORY_AND_DISK,
MEMORY_AND_DISK_SER,
DISK_ONLY, …
(memory only)
(memory only)
(lazy persistence & caching)
SchemaRDD
Row
...
...
...
...
Row
Row
Row
...
SchemaRDD
RRD of Row + Column Metadata
Queries with SQL
Support for Reflection, JSON,
Parquet, …
SchemaRDD
Row
...
...
...
...
Row
Row
Row
...
topWords
case class Word(text: String, n: Int)
val wordsFreq = freq.map {
case (text, count) => Word(text, count)
} // RDD[Word]
wordsFreq.registerTempTable("wordsFreq")
val topWords = sql("select text, n
from wordsFreq
order by n desc
limit 20”) // RDD[Row]
topWords.collect().foreach(println)
nums = words.filter(_.matches(“[0-9]+”))
RDD Lineage
HadoopRDDwords = sc.textFile(“hdfs://large/file/”)
.map(_.toLowerCase)
alpha.count()
MappedRDD
alpha = words.filter(_.matches(“[a-z]+”))
FlatMappedRDD.flatMap(_.split(“ “))
FilteredRDD
Lineage
(built on the driver
by the transformations)
FilteredRDD
Action (run job on the cluster)
RDD Transformations
Deployment with Hadoop
A
B
C
D
/large/file
Data
Node 1
Data
Node 3
Data
Node 4
Data
Node 2
A A AB BBCC
CD DDRF 3
Name
Node
Spark
Master
Spark
Worker
Spark
Worker
Spark
Worker
Spark
Worker
Client
Submit App
(mode=cluster)
Driver Executors Executors Executors
allocates resources
(cores and memory)
Application
DN + Spark
HDFSSpark
Fernando Rodriguez Olivera
twitter: @frodriguez

More Related Content

What's hot

Spark SQL principes et fonctions
Spark SQL principes et fonctionsSpark SQL principes et fonctions
Spark SQL principes et fonctionsMICHRAFY MUSTAFA
 
Cours Big Data Chap3
Cours Big Data Chap3Cours Big Data Chap3
Cours Big Data Chap3Amal Abid
 
BigData_Chp4: NOSQL
BigData_Chp4: NOSQLBigData_Chp4: NOSQL
BigData_Chp4: NOSQLLilia Sfaxi
 
BigData_TP3 : Spark
BigData_TP3 : SparkBigData_TP3 : Spark
BigData_TP3 : SparkLilia Sfaxi
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
BigData_TP5 : Neo4J
BigData_TP5 : Neo4JBigData_TP5 : Neo4J
BigData_TP5 : Neo4JLilia Sfaxi
 
Cours Big Data Chap4 - Spark
Cours Big Data Chap4 - SparkCours Big Data Chap4 - Spark
Cours Big Data Chap4 - SparkAmal Abid
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Edureka!
 
Cours Big Data Chap2
Cours Big Data Chap2Cours Big Data Chap2
Cours Big Data Chap2Amal Abid
 
Installation hadoopv2.7.4-amal abid
Installation hadoopv2.7.4-amal abidInstallation hadoopv2.7.4-amal abid
Installation hadoopv2.7.4-amal abidAmal Abid
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Alexis Seigneurin
 
Présentation des bases de données orientées graphes
Présentation des bases de données orientées graphesPrésentation des bases de données orientées graphes
Présentation des bases de données orientées graphesKoffi Sani
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
BigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big DataBigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big DataLilia Sfaxi
 

What's hot (20)

Spark SQL principes et fonctions
Spark SQL principes et fonctionsSpark SQL principes et fonctions
Spark SQL principes et fonctions
 
Cours Big Data Chap3
Cours Big Data Chap3Cours Big Data Chap3
Cours Big Data Chap3
 
BigData_Chp4: NOSQL
BigData_Chp4: NOSQLBigData_Chp4: NOSQL
BigData_Chp4: NOSQL
 
Chapitre 3 spark
Chapitre 3 sparkChapitre 3 spark
Chapitre 3 spark
 
BigData_TP3 : Spark
BigData_TP3 : SparkBigData_TP3 : Spark
BigData_TP3 : Spark
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
BigData_TP5 : Neo4J
BigData_TP5 : Neo4JBigData_TP5 : Neo4J
BigData_TP5 : Neo4J
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Les BD NoSQL
Les BD NoSQLLes BD NoSQL
Les BD NoSQL
 
Cours Big Data Chap4 - Spark
Cours Big Data Chap4 - SparkCours Big Data Chap4 - Spark
Cours Big Data Chap4 - Spark
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Cours Big Data Chap2
Cours Big Data Chap2Cours Big Data Chap2
Cours Big Data Chap2
 
Installation hadoopv2.7.4-amal abid
Installation hadoopv2.7.4-amal abidInstallation hadoopv2.7.4-amal abid
Installation hadoopv2.7.4-amal abid
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)
 
Présentation des bases de données orientées graphes
Présentation des bases de données orientées graphesPrésentation des bases de données orientées graphes
Présentation des bases de données orientées graphes
 
Presentation 6lowpan
Presentation 6lowpanPresentation 6lowpan
Presentation 6lowpan
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
BigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big DataBigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big Data
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 

Viewers also liked

E6 dieu hanh cho tuong lai
E6 dieu hanh cho tuong laiE6 dieu hanh cho tuong lai
E6 dieu hanh cho tuong laico_doc_nhan
 
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...Dominik Gruber
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientistMassimiliano Martella
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadershipsjoerdluteyn
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinesecolorant
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute modelDean Wampler
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)Hadoop / Spark Conference Japan
 
Pixie dust overview
Pixie dust overviewPixie dust overview
Pixie dust overviewDavid Taieb
 
Why dont you_create_new_spark_jl
Why dont you_create_new_spark_jlWhy dont you_create_new_spark_jl
Why dont you_create_new_spark_jlShintaro Fukushima
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2David Taieb
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
 

Viewers also liked (20)

Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
E6 dieu hanh cho tuong lai
E6 dieu hanh cho tuong laiE6 dieu hanh cho tuong lai
E6 dieu hanh cho tuong lai
 
Scala idioms
Scala idiomsScala idioms
Scala idioms
 
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...
 
Performance
PerformancePerformance
Performance
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientist
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
 
Pixie dust overview
Pixie dust overviewPixie dust overview
Pixie dust overview
 
Why dont you_create_new_spark_jl
Why dont you_create_new_spark_jlWhy dont you_create_new_spark_jl
Why dont you_create_new_spark_jl
 
Spark in 15 min
Spark in 15 minSpark in 15 min
Spark in 15 min
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
 

Similar to Apache Spark with Scala

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkWill Du
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 

Similar to Apache Spark with Scala (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 

Recently uploaded

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 

Recently uploaded (20)

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 

Apache Spark with Scala

  • 1. Apache Spark Fernando Rodriguez Olivera @frodriguez Buenos Aires, Argentina, Nov 2014 JAVACONF 2014
  • 2. Fernando Rodriguez Olivera Twitter: @frodriguez Professor at Universidad Austral (Distributed Systems, Compiler Design, Operating Systems, …) Creator of mvnrepository.com Organizer at Buenos Aires High Scalability Group, Professor at nosqlessentials.com
  • 3. Apache Spark Apache Spark is a Fast and General Engine for Large-Scale data processing Supports for Batch, Interactive and Stream processing with Unified API In-Memory computing primitives
  • 4. Hadoop MR Limits Job Job Job Hadoop HDFS - Communication between jobs through FS - Fault-Tolerance (between jobs) by Persistence to FS - Memory not managed (relies on OS caches) MapReduce designed for Batch Processing: Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
  • 5. Daytona Gray Sort 100TB Benchmark source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized
  • 6. Daytona Gray Sort 100TB Benchmark source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized 3X faster using 10X fewer machines
  • 7. Hadoop vs Spark for Iterative Proc source: https://spark.apache.org/ Logistic regression in Hadoop and Spark
  • 8. Apache Spark Apache Spark (Core) Spark SQL Spark Streaming ML lib GraphX Powered by Scala and Akka
  • 9. Resilient Distributed Datasets (RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects
  • 10. Resilient Distributed Datasets (RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects Partitioned and Distributed
  • 11. Resilient Distributed Datasets (RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects Partitioned and Distributed Stored in Memory
  • 12. Resilient Distributed Datasets (RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects Partitioned and Distributed Stored in Memory Partitions Recomputed on Failure
  • 13. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings
  • 14. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings e.g: apply function to count chars Compute Function (transformation)
  • 15. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation)
  • 16. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation) depends on
  • 17. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation) depends on N Int Action
  • 18. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation) Partitions Compute Function Dependencies Preferred Compute Location (for each partition) RDD Implementation Partitioner depends on N Int Action
  • 19. Spark API val spark = new SparkContext() val lines = spark.textFile(“hdfs://docs/”) // RDD[String] val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String] val count = nonEmpty.count Scala SparkContext spark = new SparkContext(); JavaRDD<String> lines = spark.textFile(“hdfs://docs/”) JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0); long count = nonEmpty.count(); Java8Python spark = SparkContext() lines = spark.textFile(“hdfs://docs/”) nonEmpty = lines.filter(lambda line: len(line) > 0) count = nonEmpty.count()
  • 21. Text Processing Example Top Words by Frequency (Step by step)
  • 22. Create RDD from External Data // Step 1 - Create RDD from Hadoop Text File val docs = spark.textFile(“/docs/”) Hadoop FileSystem, I/O Formats, Codecs HBaseS3HDFS MongoDB Cassandra … Apache Spark Spark can read/write from any data source supported by Hadoop I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop) ElasticSearch
  • 23. Function map Hello World A New Line hello ... The end .map(line => line.toLowerCase) RDD[String] RDD[String] hello world a new line hello ... the end .map(_.toLowerCase) // Step 2 - Convert lines to lower case val lower = docs.map(line => line.toLowerCase) =
  • 24. Functions map and flatMap hello world a new line hello ... the end RDD[String]
  • 25. Functions map and flatMap hello world a new line hello ... the end RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]]
  • 26. Functions map and flatMap hello world a new line hello ... the end RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]] .flatten hello a ... world new line RDD[String] *
  • 27. Functions map and flatMap hello world a new line hello ... the end .flatMap(line => line.split(“s+“)) RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]] .flatten hello a ... world new line RDD[String] *
  • 28. Functions map and flatMap hello world a new line hello ... the end .flatMap(line => line.split(“s+“)) Note: flatten() not available in spark, only flatMap RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]] // Step 3 - Split lines into words val words = lower.flatMap(line => line.split(“s+“)) .flatten hello a ... world new line RDD[String] *
  • 29. Key-Value Pairs hello a ... world new line hello hello a ... world new line hello .map(word => Tuple2(word, 1)) 1 1 1 1 1 1 .map(word => (word, 1)) RDD[String] RDD[(String, Int)] // Step 4 - Split lines into words val counts = words.map(word => (word, 1)) = RDD[Tuple2[String, Int]] Pair RDD
  • 32. Shuffling hello a world new line hello 1 1 1 1 1 1 world a 1 1 new 1 line hello 1 1 .groupByKey RDD[(String, Iterator[Int])] 1 RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b RDD[(String, Int)]
  • 33. Shuffling hello a world new line hello 1 1 1 1 1 1 .reduceByKey((a, b) => a + b) world a 1 1 new 1 line hello 1 1 .groupByKey RDD[(String, Iterator[Int])] 1 RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b RDD[(String, Int)]
  • 34. Shuffling hello a world new line hello 1 1 1 1 1 1 .reduceByKey((a, b) => a + b) // Step 5 - Count all words val freq = counts.reduceByKey(_ + _) world a 1 1 new 1 line hello 1 1 .groupByKey RDD[(String, Iterator[Int])] 1 RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b RDD[(String, Int)]
  • 35. Top N (Prepare data) world a 1 1 new 1 line hello 1 2 // Step 6 - Swap tuples (partial code) freq.map(_.swap) .map(_.swap) world a 1 1 new1 line hello 1 2 RDD[(String, Int)] RDD[(Int, String)]
  • 36. Top N (First Attempt) world a 1 1 new1 line hello 1 2 RDD[(Int, String)]
  • 37. Top N (First Attempt) world a 1 1 new1 line hello 1 2 RDD[(Int, String)] .sortByKey RDD[(Int, String)] hello world 2 1 a1 new line 1 1 (sortByKey(false) for descending)
  • 38. Top N (First Attempt) world a 1 1 new1 line hello 1 2 RDD[(Int, String)] Array[(Int, String)] hello world 2 1 .take(N).sortByKey RDD[(Int, String)] hello world 2 1 a1 new line 1 1 (sortByKey(false) for descending)
  • 39. Top N Array[(Int, String)] world a 1 1 new1 line hello 1 2 RDD[(Int, String)] world a 1 1 .top(N) hello line 2 1 hello line 2 1 local top N * local top N * reduction // Step 6 - Swap tuples (complete code) val top = freq.map(_.swap).top(N) * local top N implemented by bounded priority queues
  • 40. val spark = new SparkContext() // RDD creation from external data source val docs = spark.textFile(“hdfs://docs/”) // Split lines into words val lower = docs.map(line => line.toLowerCase) val words = lower.flatMap(line => line.split(“s+“)) val counts = words.map(word => (word, 1)) // Count all words (automatic combination) val freq = counts.reduceByKey(_ + _) // Swap tuples and get top results val top = freq.map(_.swap).top(N) top.foreach(println) Top Words by Frequency (Full Code)
  • 42. SchemaRDD Row ... ... ... ... Row Row Row ... SchemaRDD RRD of Row + Column Metadata Queries with SQL Support for Reflection, JSON, Parquet, …
  • 43. SchemaRDD Row ... ... ... ... Row Row Row ... topWords case class Word(text: String, n: Int) val wordsFreq = freq.map { case (text, count) => Word(text, count) } // RDD[Word] wordsFreq.registerTempTable("wordsFreq") val topWords = sql("select text, n from wordsFreq order by n desc limit 20”) // RDD[Row] topWords.collect().foreach(println)
  • 44. nums = words.filter(_.matches(“[0-9]+”)) RDD Lineage HadoopRDDwords = sc.textFile(“hdfs://large/file/”) .map(_.toLowerCase) alpha.count() MappedRDD alpha = words.filter(_.matches(“[a-z]+”)) FlatMappedRDD.flatMap(_.split(“ “)) FilteredRDD Lineage (built on the driver by the transformations) FilteredRDD Action (run job on the cluster) RDD Transformations
  • 45. Deployment with Hadoop A B C D /large/file Data Node 1 Data Node 3 Data Node 4 Data Node 2 A A AB BBCC CD DDRF 3 Name Node Spark Master Spark Worker Spark Worker Spark Worker Spark Worker Client Submit App (mode=cluster) Driver Executors Executors Executors allocates resources (cores and memory) Application DN + Spark HDFSSpark