SlideShare a Scribd company logo
1 of 108
Download to read offline
© 2014 MapR Technologies 1© 2014 MapR Technologies
An Overview of Apache Spark
© 2014 MapR Technologies 2
Find my presentation and other related resources here:
http://events.mapr.com/JacksonvilleJava
(you can find this link in the event’s page at meetup.com)
Today’s Presentation
Whiteboard & demo
videos
Free On-Demand Training
Free eBooks
Free Hadoop Sandbox And more…
© 2014 MapR Technologies 3
Agenda
• MapReduce Refresher
• What is Spark?
• The Difference with Spark
• Examples and Resources
© 2014 MapR Technologies 4© 2014 MapR Technologies
MapReduce Refresher
© 2014 MapR Technologies 5
MapReduce: A Programming Model
• MapReduce:
Simplified Data
Processing on Large
Clusters
(published 2004)
• Parallel and Distributed
Algorithm:
• Data Locality
• Fault Tolerance
• Linear Scalability
© 2014 MapR Technologies 6
The Hadoop Strategy
http://developer.yahoo.com/hadoop/tutorial/module4.html
Distribute data
(share nothing)
Distribute computation
(parallelization without synchronization)
Tolerate failures
(no single point of failure)
Node 1
Mapping process
Node 2
Mapping process
Node 3
Mapping process
Node 1
Reducing process
Node 2
Reducing process
Node 3
Reducing process
© 2014 MapR Technologies 7
Chunks are replicated across the cluster
Distribute Data: HDFS
User process
NameNode
. . .
network
HDFS splits large data files
into chunks (64 MB)
metadata
access physical data access
Location metadata
DataNodes store & retrieve data
data
© 2014 MapR Technologies 8
Distribute Computation
MapReduce
Program
Data
Sources
Hadoop Cluster
Result
© 2014 MapR Technologies 9
MapReduce Execution and Data Flow
• Map
– Loading data
and defining
a key values
• Shuffle
– Sorts,collects
key-based
data
• Reduce
– Receives
(key, list)’s
to process
and output
Files loaded from HDFS stores
file file
Files loaded from HDFS stores
Node 1
InputFormat InputFormat
OutputFormat OutputFormat
Final (k, v) pairs Final (k, v) pairs
reduce reduce
(sort) (sort)
Input (k, v) pairs
map map map
RR RR RR
RecordReaders:
Split Split Split
Writeback to
Local HDFS
store
file
Writeback to
Local HDFS
store
file
SplitSplitSplit
RRRRRR
RecordReaders:
Input (k, v) pairs
mapmapmap
Node2
“Shuffle” process
Intermediate (k, v)
Pairs exchanged
By all nodes
Partitioner
Intermediate (k, v) pairs
Partitioner
Intermediate (k, v) pairs
© 2014 MapR Technologies 10
MapReduce Example: Word Count
Output
"The time has come," the Walrus said,
"To talk of many things:
Of shoes—and ships—and sealing-wax
the, 1
time, 1
has, 1
come, 1
…
and, 1
…
and, 1
…
and, [1, 1, 1]
come, [1,1,1]
has, [1,1]
the, [1,1,1]
time, [1,1,1,1]
…
and, 3
come, 3
has, 2
the, 3
time, 4
…
Input Map
Shuffle
and Sort
Reduce Output
Reduce
© 2014 MapR Technologies 11
Tolerate Failures
Hadoop Cluster
Failures are expected & managed gracefully
DataNode fails -> name node will locate replica
MapReduce task fails -> job tracker will schedule another one
Data
© 2014 MapR Technologies 12
MapReduce Processing Model
• For complex work, chain jobs together
– Use a higher level language or DSL that does this for you
© 2014 MapR Technologies 13
Typical MapReduce Workflows
Input to
Job 1
SequenceFile
Last Job
Maps Reduces
SequenceFile
Job 1
Maps Reduces
SequenceFile
Job 2
Maps Reduces
Output from
Job 1
Output from
Job 2
Input to
last job
Output from
last job
HDFS
Iteration is slow because it writes/reads data to disk
© 2014 MapR Technologies 14
Iterations
Step Step Step Step Step
In-memory Caching
• Data Partitions read from RAM instead of disk
© 2014 MapR Technologies 15
MapReduce Design Patterns
• Summarization
– Inverted index, counting
• Filtering
– Top ten, distinct
• Aggregation
• Data Organziation
– partitioning
• Join
– Join data sets
• Metapattern
– Job chaining
© 2014 MapR Technologies 16
Inverted Index Example
come, (alice.txt)
do, (macbeth.txt)
has, (alice.txt)
time, (alice.txt, macbeth.txt)
. . .
"The time has
come," the
Walrus said
alice.txt
tis time to do it
macbeth.txt
time, alice.txt
has, alice.txt
come, alice.txt
..
tis, macbeth.txt
time, macbeth.txt
do, macbeth.txt
…
© 2014 MapR Technologies 20
MapReduce: The Good
• Optimized to Read and write large data files quickly
– Process data on node
– Minimize movement of disk head
• Scalable
• Built in fault tolerance
• Developer focuses on Map/Reduce, not infrastructure
• simple? API
© 2014 MapR Technologies 21
MapReduce: The Bad
• Optimized for disk IO
– Doesn’t leverage memory well
– Iterative algorithms go through disk IO path again and again
• Primitive API
– simple abstraction
– Key/Value in/out
© 2014 MapR Technologies 22
Free Hadoop MapReduce On Demand Training
• https://www.mapr.com/services/mapr-academy/big-data-hadoop-
online-training
© 2014 MapR Technologies 23
What is Hive?
• Data Warehouse on top of Hadoop
– analytical querying of data
• without programming
• SQL like execution for Hadoop
– SQL evaluates to MapReduce code
• Submits jobs to your cluster
© 2014 MapR Technologies 24
Job
Tracker
Name
Node
HADOOP
(MAP-REDUCE + HDFS)
Data Node
+
Task Tracker
Hive Metastore
Driver
(compiler, Optimizer, Executor)
Command Line
Interface
Web
Interface
JDBC
Thrift Server
ODBC
Metastore
Hive
The schema metadata is stored
in the Hive metastore
Hive Table definition HBase trades_tall Table
© 2014 MapR Technologies 26
Hive HBase – External Table
key cf1:price cf1:vol
AMZN_986186008 12.34 1000
AMZN_986186007 12.00 50
Selection
WHERE key like
SQL evaluates to MapReduce code
SELECT AVG(price) FROM trades WHERE key LIKE “AMZN” ;
Projection
select price
Aggregation
Avg( price)
© 2014 MapR Technologies 27
Hive HBase – Hive Query
SQL evaluates to MapReduce code
SELECT AVG(price) FROM trades WHERE key LIKE "GOOG” ;
HBase Tables
Queries
Parser Planner Execution
© 2014 MapR Technologies 28
Hive Query Plan
• EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
Filter Operator
predicate: (key like 'GOOG%') (type: boolean)
Select Operator
Group By Operator
Reduce Operator Tree:
Group By Operator
Select Operator
File Output Operator
© 2014 MapR Technologies 29
Hive Query Plan – (2)
output
hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";
col0
Trades
table
group
aggregations:
avg(price)
scan filter
Select
key like 'GOOG%
Select price
Group by
map()
map()
map()
reduce()
reduce()
© 2014 MapR Technologies 32
What is a Directed Acylic Graph (DAG) ?
• Graph
– vertices (points) and edges (lines)
• Directed
– Only in a single direction
• Acyclic
– No looping
BA
BA
© 2014 MapR Technologies 33
Hive Query Plan Map Reduce Execution
FS1
AGG2
RS4
JOIN1
RS2
AGG1
RS1
t1
RS3
t1
Job 3
Job 2
FS1
AGG2
JOIN1
AGG1
RS1
t1
RS3Job 1
Job 1
Optimize
© 2014 MapR Technologies 35
Free HBase On Demand Training
(includes Hive and MapReduce with HBase)
• https://www.mapr.com/services/mapr-academy/big-data-hadoop-
online-training
© 2014 MapR Technologies 40© 2014 MapR Technologies
Apache Spark
© 2014 MapR Technologies 42
Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN
Unified Platform
© 2014 MapR Technologies 43
Why Apache Spark?
• Distributed Parallel Cluster computing
– Scalable , Fault tolerant
• Programming Framework
– APIs in Java, Scala, Python
– Less code than map reduce
• Fast
– Runs computations in memory
– Tasks are threads
© 2014 MapR Technologies 46
Spark Use Cases
• Iterative Algorithms on large amounts of data
• Some Example Use Cases:
– Anomaly detection
– Classification
– Predictions
– Recommendations
© 2014 MapR Technologies 47
Why Iterative Algorithms
• Algorithms that need iterations
– Clustering (K-Means, Canopy, …)
– Gradient descent (e.g., Logistic Regression, Matrix Factorization)
– Graph Algorithms (e.g., PageRank, Line-Rank, components, paths,
reachability, centrality, )
– Alternating Least Squares ALS
– Graph communities / dense sub-components
– Inference (believe propagation)
– …
47
© 2014 MapR Technologies 48
Data Sources
• Local Files
• S3
• Hadoop Distributed Filesystem
– any Hadoop InputFormat
• HBase
• other NoSQL data stores
© 2014 MapR Technologies 51© 2014 MapR Technologies
How Spark Works
© 2014 MapR Technologies 52
Spark Programming Model
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.map
Driver Program
SparkContext
cluster
Worker Node
Task
Task
Task Worker Node
© 2014 MapR Technologies 53
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
elements
© 2014 MapR Technologies 54
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
elements
• operated on in parallel
• Cached in memory
– Or on disk
• Fault tolerant
© 2014 MapR Technologies 55
Working With RDDs
RDD
textFile = sc.textFile(”SomeFile.txt”)
© 2014 MapR Technologies 56
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)
linesRDD = sc.textFile(”LogFile.txt”)
© 2014 MapR Technologies 57
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithErrorRDD.count()
6
linesWithErrorRDD.first()
# Error line
textFile = sc.textFile(”SomeFile.txt”)
linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)
© 2014 MapR Technologies 58
Example Spark Word Count in Scala
...the
...
"The time has come," the Walrus said,
"To talk of many things:
Of shoes—and ships—and sealing-wax
andtime and
the, 1 time, 1 and, 1 and, 1
and, 12time, 4 ...the, 20
// Load our input data.
val input = sc.textFile(inputFile)
// Split it up into words.
val words = input.flatMap(line => line.split(" "))
// Transform into pairs and count.
val counts = words
.map(word => (word, 1))
.reduceByKey{case (x, y) => x + y}
// Save the word count back out to a text file,
counts.saveAsTextFile(outputFile)
the, 20 time, 4 ….. and, 12
.........
© 2014 MapR Technologies 59
Example Spark Word Count in Scala
59
HadoopRDD
textFile
// Load input data.
val input = sc.textFile(inputFile)
RDD
partitions
MapPartitionsRDD
© 2014 MapR Technologies 60
Example Spark Word Count in Scala
60
// Load our input data.
val input = sc.textFile(inputFile)
// Split it up into words.
val words = input.flatMap(line => line.split(" "))
HadoopRDD
textFile flatmap
MapPartitionsRDD
MapPartitionsRDD
© 2014 MapR Technologies 61
FlatMap
flatMap
line => line.split(" "))
1 to many mapping
Ships
Ships and wax and
wax
RDD<String> wordsRDD
© 2014 MapR Technologies 62
Example Spark Word Count in Scala
62
textFile flatmap map
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
// Transform into pairs
val counts = words.map(word => (word, 1))
HadoopRDD
MapPartitionsRDD
MapPartitionsRDD
MapPartitionsRDD
© 2014 MapR Technologies 63
Map
map
word => (word, 1))
1 to 1 mapping
PairRDD<String, Integer> word1s
Ships
and
wax
Ships,1
and, 1
wax, 1
© 2014 MapR Technologies 64
Example Spark Word Count in Scala
64
textFile flatmap map reduceByKey
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
val counts = words
.map(word => (word, 1))
.reduceByKey{case (x, y) => x + y}
HadoopRDD
MapPartitionsRDD
MapPartitionsRDD
ShuffledRDD
MapPartitionsRDD
© 2014 MapR Technologies 65
reduceByKey
reduceByKey case (x, y) => x + y
and, 1 , 1
wax, 1
and, 2
PairRDD<String, Integer> coun
PairRDD Function :
reduceByKey(func: (value, value) ⇒ value): RDD[(key, value)
wax, 1
and, 1
and, 1
wax, 1
case (0, 1) => 0 + 1
case (1, 1) => 1 + 1
© 2014 MapR Technologies 66
Example Spark Word Count in Scala
textFile flatmap map reduceByKey
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
val counts = words
.map(word => (word, 1))
.reduceByKey{case (x, y) => x + y}
val countArray = counts.collect()
HadoopRDD
MapPartitionsRDD
MapPartitionsRDD
MapPartitionsRDD
collect
ShuffledRDD
Array
© 2014 MapR Technologies 67
Demo Interactive Shell
• Iterative Development
– Cache those RDDs
– Open the shell and ask questions
• We have all wished we could do this with MapReduce
– Compile / save your code for scheduled jobs later
• Scala – spark-shell
• Python – pyspark
© 2014 MapR Technologies 68© 2014 MapR Technologies
Components Of Execution
© 2014 MapR Technologies 69
Spark RDD DAG -> Physical Execution plan
HadoopRDD
sc.textfile(…)
MapPartitionsRDD
flatmap
flatmap
reduceByKey
RDD Graph Physical Plan
collect
MapPartitionsRDD
ShuffledRDD
MapPartitionsRDD
Stage 1
Stage 2
© 2014 MapR Technologies 70
Physical Plan
DAG
Stage 1
Stage 2
Task Task Task Task
Task Task Task
Stage 1
Stage 2
Split into Tasks
HFile
HDFS
Data Node
Worker Node
block
cache
partition
Executor
HFile
block
HFileHFile
Task thread
Task
Set
Task
Scheduler
Task
Physical Execution plan -> Stages and Tasks
© 2014 MapR Technologies 71
Summary of Components
• Task : unit of execution
• Stage: Group of Tasks
– Base on partitions of RDD
– Tasks run in parallel
• DAG : Logical Graph of RDD operations
• RDD : Parallel dataset with partitions
71
© 2014 MapR Technologies 72
How Spark Application runs on a Hadoop cluster
HFile
HDFS Data Node
Worker Node
block
cache
partitiontask
task
Executor
HFile
block
HFileHFile
SparkContext
zookeeper
YARN
Resource
Manager
HFile
HDFS Data Node
Worker Node
block
cache
partitiontask
task
Executor
HFile
block
HFileHFile
Client node
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.map
Driver Program
Yarn
Node
Manger
Yarn
Node
Manger
© 2014 MapR Technologies 73
Deploying Spark – Cluster Manager Types
• Standalone mode
• Mesos
• YARN
• EC2
• GCE
© 2014 MapR Technologies 74
RDD Transformations and Actions
RDD
RDD
RDD
RDDTransformations Action Value
Transformations
(define a new RDD)
map
filter
sample
union
groupByKey
reduceByKey
join
cache
…
Actions
(return a value)
reduce
collect
count
save
lookupKey
…
© 2014 MapR Technologies 75
MapR Tutorial: Getting Started with Spark on MapR Sandbox
• https://www.mapr.com/products/mapr-sandbox-
hadoop/tutorials/spark-tutorial
© 2014 MapR Technologies 76
MapR Blog: Getting Started with the Spark Web UI
• https://www.mapr.com/blog/getting-started-spark-web-ui
© 2014 MapR Technologies 77© 2014 MapR Technologies
Example: Log Mining
© 2014 MapR Technologies 78
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Based on slides from Pat McDonough at
© 2014 MapR Technologies 79
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
© 2014 MapR Technologies 80
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
© 2014 MapR Technologies 81
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
Base RDD
© 2014 MapR Technologies 82
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
© 2014 MapR Technologies 83
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
Transformed RDD
© 2014 MapR Technologies 84
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
© 2014 MapR Technologies 85
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Action
© 2014 MapR Technologies 86
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
© 2014 MapR Technologies 87
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
tasks
tasks
tasks
© 2014 MapR Technologies 88
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Read
HDFS
Block
Read
HDFS
Block
Read
HDFS
Block
© 2014 MapR Technologies 89
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
© 2014 MapR Technologies 90
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
results
results
results
© 2014 MapR Technologies 91
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
© 2014 MapR Technologies 92
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
tasks
tasks
tasks
Driver
© 2014 MapR Technologies 93
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Process
from
Cache
Process
from
Cache
Process
from
Cache
© 2014 MapR Technologies 94
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
results
results
results
© 2014 MapR Technologies 95
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Cache your data  Faster Results
© 2014 MapR Technologies 96© 2014 MapR Technologies
Dataframes
© 2014 MapR Technologies 97
Scenario: Online Auction Data
AuctionID Bid Bid Time Bidder Bidder Rate Open Bid Price Item Days To Live
8213034705 95 2.927373 jake7870 0 95 117.5 xbox 3
8213034705 115 2.943484 davidbresler2 1 95 117.5 xbox 3
8213034705 100 2.951285 gladimacowgirl 58 95 117.5 xbox 3
8213034705 117.5 2.998947 daysrus 95 95 117.5 xbox 3
© 2014 MapR Technologies 98
Creating DataFrames
// Define schema using a Case class
case class Auction(auctionid: String, bid: Float,
bidtime: Float, bidder: String, bidderrate: Int,
openbid: Float, finprice: Float, itemtype: String,
dtl: Int)
© 2014 MapR Technologies 99
Creating DataFrames
// Create the RDD
val auctionRDD=sc.textFile(“/user/user01/data/ebay.csv”)
.map(_.split(”,"))
// Map the data to the Auctions class
val auctions=auctionRDD.map(a=>Auction(a(0),
a(1).toFloat, a(2).toFloat, a(3), a(4).toInt,
a(5).toFloat, a(6).toFloat, a(7), a(8).toInt))
© 2014 MapR Technologies 100
Creating DataFrames
// Convert RDD to DataFrame
val auctionsDF=auctions.toDF()
//To see the data
auctionsDF.show()
auctionid bid bidtime bidder bidderrate openbid price item daystolive
8213034705 95.0 2.927373 jake7870 0 95.0 117.5 xbox 3
8213034705 115.0 2.943484 davidbresler2 1 95.0 117.5 xbox 3 …
© 2014 MapR Technologies 101
Inspecting Data: Examples
//To see the schema
auctionsDF.printSchema()
root
|-- auctionid: string (nullable = true)
|-- bid: float (nullable = false)
|-- bidtime: float (nullable = false)
|-- bidder: string (nullable = true)
|-- bidderrate: integer (nullable = true)
|-- openbid: float (nullable = false)
|-- price: float (nullable = false)
|-- item: string (nullable = true)
|-- daystolive: integer (nullable = true)
© 2014 MapR Technologies 102
Inspecting Data: Examples
//Total number of bids
val totbids=auctionsDF.count()
//How many bids per item?
auction.groupBy("auctionid", "item").count.show
auctionid item count
3016429446 palm 10
8211851222 xbox 28
3014480955 palm 12
8214279576 xbox 4
3014844871 palm 18
3014012355 palm 35
1641457876 cartier 2
© 2014 MapR Technologies 103
Creating DataFrames
// Register the DF as a table
auctionsDF.registerTempTable(“auctionsDF”)
val results =sqlContext.sql("SELECT auctionid, MAX(price)
as maxprice FROM auction GROUP BY item,auctionid")
results.show()
auctionid maxprice
3019326300 207.5
8213060420 120.0 . . .
© 2014 MapR Technologies 104
The physical plan for DataFrames
© 2014 MapR Technologies 105
DataFrame Excecution plan
// Print the physical plan to the console
auction.select("auctionid").distinct.explain()
== Physical Plan ==
Distinct false
Exchange (HashPartitioning [auctionid#0], 200)
Distinct true
Project [auctionid#0]
PhysicalRDD
[auctionid#0,bid#1,bidtime#2,bidder#3,
bidderrate#4,openbid#5,price#6,item#7,daystolive#8],
MapPartitionsRDD[11] at mapPartitions at
ExistingRDD.scala:37
© 2014 MapR Technologies 106
MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data
• https://www.mapr.com/blog/using-apache-spark-dataframes-
processing-tabular-data
© 2014 MapR Technologies 107© 2014 MapR Technologies
Machine Learning
© 2014 MapR Technologies 108
Collaborative Filtering with Spark
• Recommend Items
– (filtering)
• Based on User preferences data
– (collaborative)
© 2014 MapR Technologies 109
Alternating Least Squares
• approximates sparse user item rating matrix
• as product of two dense matrices, User and Item factor matrices
© 2014 MapR Technologies 110
Parse Input
// parse input UserID::MovieID::Rating
def parseRating(str: String): Rating= {
val fields = str.split("::")
Rating(fields(0).toInt, fields(1).toInt,
fields(2).toDouble)
}
// create an RDD of Ratings objects
val ratingsRDD = ratingText.map(parseRating).cache()
© 2014 MapR Technologies 111
ML Cross Validation Process
Data
Model
Training/
Building
Model
Testing
Test
Set
Train Test loop
Training
Set
© 2014 MapR Technologies 112
Create Model
// Randomly split ratings RDD into training data RDD (80%)
and test data RDD (20%)
val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L)
val trainingRatingsRDD = splits(0).cache()
val testRatingsRDD = splits(1).cache()
// build a ALS user product matrix model with rank=20,
iterations=10
val model = (new
ALS().setRank(20).setIterations(10).run(trainingRatingsRDD))
© 2014 MapR Technologies 113
Get predictions
// Get the top 5 movie predictions for user 4169
val topRecsForUser = model.recommendProducts(4169, 5)
// get (user,product) pairs
val testUserProductRDD = testRatingsRDD.map {
case Rating(user, product, rating) => (user, product)
}
// get predictions for (user,product) pairs
val predictionsForTestRDD = model.predict(testUserProductRDD)
© 2014 MapR Technologies 114
Test Model
// prepare predictions for comparison
val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
// prepare test for comparison
val testKeyedByUserProductRDD = testRatingsRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
//Join the test with predictions
val testAndPredictionsJoinedRDD =
testKeyedByUserProductRDD.join(predictionsKeyedByUserProductRDD)
© 2014 MapR Technologies 115
Test Model
val falsePositives =(testAndPredictionsJoinedRDD.filter{
case ((user, product), (ratingT, ratingP)) =>
(ratingT <= 1 && ratingP >=4)
})
falsePositives.take(2)
Array[((Int, Int), (Double, Double))] =
((3842,2858),(1.0,4.106488210964762)),
((6031,3194),(1.0,4.790778049100913))
© 2014 MapR Technologies 116
Test Model
//Evaluate the model using Mean Absolute Error (MAE) between
test and predictions
val meanAbsoluteError = testAndPredictionsJoinedRDD.map {
case ((user, product), (testRating, predRating)) =>
val err = (testRating - predRating)
Math.abs(err)
}.mean()
meanAbsoluteError: Double = 0.7244940545944053
© 2014 MapR Technologies 117
Soon to Come
• Spark On Demand Training
– https://www.mapr.com/services/mapr-academy/
• Blogs and Tutorials:
– Movie Recommendations with Collaborative Filtering
– Spark Streaming
© 2014 MapR Technologies 118
Machine Learning Blog
• https://www.mapr.com/blog/parallel-and-iterative-processing-
machine-learning-recommendations-spark
© 2014 MapR Technologies 119© 2014 MapR Technologies
Examples and Resources
© 2014 MapR Technologies 120
Coming soon Free Spark On Demand Training
• https://www.mapr.com/services/mapr-academy/
© 2014 MapR Technologies 121
Find my presentation and other related resources here:
http://events.mapr.com/JacksonvilleJava
(you can find this link in the event’s page at meetup.com)
Today’s Presentation
Whiteboard & demo
videos
Free On-Demand Training
Free eBooks
Free Hadoop Sandbox And more…
© 2014 MapR Technologies 122
Spark on MapR
• Certified Spark Distribution
• Fully supported and packaged by MapR in partnership with
Databricks
– mapr-spark package with Spark, Shark, Spark Streaming today
– Spark-python, GraphX and MLLib soon
• YARN integration
– Spark can then allocate resources from cluster when needed
© 2014 MapR Technologies 123
References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on MapR:
– http://www.mapr.com/products/apache-spark
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark
© 2014 MapR Technologies 124
Q&A
@mapr maprtech
kbotzum@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

What's hot

Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBaseCarol McDonald
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkMapR Technologies
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient DataCarol McDonald
 
Design Patterns for working with Fast Data
Design Patterns for working with Fast DataDesign Patterns for working with Fast Data
Design Patterns for working with Fast DataMapR Technologies
 
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningPredicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningCarol McDonald
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBCarol McDonald
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Carol McDonald
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Carol McDonald
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsDatabricks
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 

What's hot (20)

Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache Spark
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient Data
 
Design Patterns for working with Fast Data
Design Patterns for working with Fast DataDesign Patterns for working with Fast Data
Design Patterns for working with Fast Data
 
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningPredicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 

Viewers also liked

Apache spark when things go wrong
Apache spark   when things go wrongApache spark   when things go wrong
Apache spark when things go wrongPawel Szulc
 
Introduction to type classes
Introduction to type classesIntroduction to type classes
Introduction to type classesPawel Szulc
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshopPawel Szulc
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanTaro L. Saito
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning Keon Kim
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 

Viewers also liked (13)

Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Apache spark when things go wrong
Apache spark   when things go wrongApache spark   when things go wrong
Apache spark when things go wrong
 
Introduction to type classes
Introduction to type classesIntroduction to type classes
Introduction to type classes
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshop
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 

Similar to Introduction to Spark

Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
10c introduction
10c introduction10c introduction
10c introductionInyoung Cho
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)BigDataEverywhere
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 

Similar to Introduction to Spark (20)

Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
10c introduction
10c introduction10c introduction
10c introduction
 
10c introduction
10c introduction10c introduction
10c introduction
 
Is Spark Replacing Hadoop
Is Spark Replacing HadoopIs Spark Replacing Hadoop
Is Spark Replacing Hadoop
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
2014 08-20-pit-hug
2014 08-20-pit-hug2014 08-20-pit-hug
2014 08-20-pit-hug
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 

More from Carol McDonald

Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Carol McDonald
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...Carol McDonald
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Carol McDonald
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningCarol McDonald
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churnCarol McDonald
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with SparkCarol McDonald
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBaseCarol McDonald
 

More from Carol McDonald (14)

Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churn
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
 
CU9411MW.DOC
CU9411MW.DOCCU9411MW.DOC
CU9411MW.DOC
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBase
 

Recently uploaded

Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 

Recently uploaded (20)

Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 

Introduction to Spark

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies An Overview of Apache Spark
  • 2. © 2014 MapR Technologies 2 Find my presentation and other related resources here: http://events.mapr.com/JacksonvilleJava (you can find this link in the event’s page at meetup.com) Today’s Presentation Whiteboard & demo videos Free On-Demand Training Free eBooks Free Hadoop Sandbox And more…
  • 3. © 2014 MapR Technologies 3 Agenda • MapReduce Refresher • What is Spark? • The Difference with Spark • Examples and Resources
  • 4. © 2014 MapR Technologies 4© 2014 MapR Technologies MapReduce Refresher
  • 5. © 2014 MapR Technologies 5 MapReduce: A Programming Model • MapReduce: Simplified Data Processing on Large Clusters (published 2004) • Parallel and Distributed Algorithm: • Data Locality • Fault Tolerance • Linear Scalability
  • 6. © 2014 MapR Technologies 6 The Hadoop Strategy http://developer.yahoo.com/hadoop/tutorial/module4.html Distribute data (share nothing) Distribute computation (parallelization without synchronization) Tolerate failures (no single point of failure) Node 1 Mapping process Node 2 Mapping process Node 3 Mapping process Node 1 Reducing process Node 2 Reducing process Node 3 Reducing process
  • 7. © 2014 MapR Technologies 7 Chunks are replicated across the cluster Distribute Data: HDFS User process NameNode . . . network HDFS splits large data files into chunks (64 MB) metadata access physical data access Location metadata DataNodes store & retrieve data data
  • 8. © 2014 MapR Technologies 8 Distribute Computation MapReduce Program Data Sources Hadoop Cluster Result
  • 9. © 2014 MapR Technologies 9 MapReduce Execution and Data Flow • Map – Loading data and defining a key values • Shuffle – Sorts,collects key-based data • Reduce – Receives (key, list)’s to process and output Files loaded from HDFS stores file file Files loaded from HDFS stores Node 1 InputFormat InputFormat OutputFormat OutputFormat Final (k, v) pairs Final (k, v) pairs reduce reduce (sort) (sort) Input (k, v) pairs map map map RR RR RR RecordReaders: Split Split Split Writeback to Local HDFS store file Writeback to Local HDFS store file SplitSplitSplit RRRRRR RecordReaders: Input (k, v) pairs mapmapmap Node2 “Shuffle” process Intermediate (k, v) Pairs exchanged By all nodes Partitioner Intermediate (k, v) pairs Partitioner Intermediate (k, v) pairs
  • 10. © 2014 MapR Technologies 10 MapReduce Example: Word Count Output "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax the, 1 time, 1 has, 1 come, 1 … and, 1 … and, 1 … and, [1, 1, 1] come, [1,1,1] has, [1,1] the, [1,1,1] time, [1,1,1,1] … and, 3 come, 3 has, 2 the, 3 time, 4 … Input Map Shuffle and Sort Reduce Output Reduce
  • 11. © 2014 MapR Technologies 11 Tolerate Failures Hadoop Cluster Failures are expected & managed gracefully DataNode fails -> name node will locate replica MapReduce task fails -> job tracker will schedule another one Data
  • 12. © 2014 MapR Technologies 12 MapReduce Processing Model • For complex work, chain jobs together – Use a higher level language or DSL that does this for you
  • 13. © 2014 MapR Technologies 13 Typical MapReduce Workflows Input to Job 1 SequenceFile Last Job Maps Reduces SequenceFile Job 1 Maps Reduces SequenceFile Job 2 Maps Reduces Output from Job 1 Output from Job 2 Input to last job Output from last job HDFS Iteration is slow because it writes/reads data to disk
  • 14. © 2014 MapR Technologies 14 Iterations Step Step Step Step Step In-memory Caching • Data Partitions read from RAM instead of disk
  • 15. © 2014 MapR Technologies 15 MapReduce Design Patterns • Summarization – Inverted index, counting • Filtering – Top ten, distinct • Aggregation • Data Organziation – partitioning • Join – Join data sets • Metapattern – Job chaining
  • 16. © 2014 MapR Technologies 16 Inverted Index Example come, (alice.txt) do, (macbeth.txt) has, (alice.txt) time, (alice.txt, macbeth.txt) . . . "The time has come," the Walrus said alice.txt tis time to do it macbeth.txt time, alice.txt has, alice.txt come, alice.txt .. tis, macbeth.txt time, macbeth.txt do, macbeth.txt …
  • 17. © 2014 MapR Technologies 20 MapReduce: The Good • Optimized to Read and write large data files quickly – Process data on node – Minimize movement of disk head • Scalable • Built in fault tolerance • Developer focuses on Map/Reduce, not infrastructure • simple? API
  • 18. © 2014 MapR Technologies 21 MapReduce: The Bad • Optimized for disk IO – Doesn’t leverage memory well – Iterative algorithms go through disk IO path again and again • Primitive API – simple abstraction – Key/Value in/out
  • 19. © 2014 MapR Technologies 22 Free Hadoop MapReduce On Demand Training • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
  • 20. © 2014 MapR Technologies 23 What is Hive? • Data Warehouse on top of Hadoop – analytical querying of data • without programming • SQL like execution for Hadoop – SQL evaluates to MapReduce code • Submits jobs to your cluster
  • 21. © 2014 MapR Technologies 24 Job Tracker Name Node HADOOP (MAP-REDUCE + HDFS) Data Node + Task Tracker Hive Metastore Driver (compiler, Optimizer, Executor) Command Line Interface Web Interface JDBC Thrift Server ODBC Metastore Hive The schema metadata is stored in the Hive metastore Hive Table definition HBase trades_tall Table
  • 22. © 2014 MapR Technologies 26 Hive HBase – External Table key cf1:price cf1:vol AMZN_986186008 12.34 1000 AMZN_986186007 12.00 50 Selection WHERE key like SQL evaluates to MapReduce code SELECT AVG(price) FROM trades WHERE key LIKE “AMZN” ; Projection select price Aggregation Avg( price)
  • 23. © 2014 MapR Technologies 27 Hive HBase – Hive Query SQL evaluates to MapReduce code SELECT AVG(price) FROM trades WHERE key LIKE "GOOG” ; HBase Tables Queries Parser Planner Execution
  • 24. © 2014 MapR Technologies 28 Hive Query Plan • EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%"; STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan Filter Operator predicate: (key like 'GOOG%') (type: boolean) Select Operator Group By Operator Reduce Operator Tree: Group By Operator Select Operator File Output Operator
  • 25. © 2014 MapR Technologies 29 Hive Query Plan – (2) output hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%"; col0 Trades table group aggregations: avg(price) scan filter Select key like 'GOOG% Select price Group by map() map() map() reduce() reduce()
  • 26. © 2014 MapR Technologies 32 What is a Directed Acylic Graph (DAG) ? • Graph – vertices (points) and edges (lines) • Directed – Only in a single direction • Acyclic – No looping BA BA
  • 27. © 2014 MapR Technologies 33 Hive Query Plan Map Reduce Execution FS1 AGG2 RS4 JOIN1 RS2 AGG1 RS1 t1 RS3 t1 Job 3 Job 2 FS1 AGG2 JOIN1 AGG1 RS1 t1 RS3Job 1 Job 1 Optimize
  • 28. © 2014 MapR Technologies 35 Free HBase On Demand Training (includes Hive and MapReduce with HBase) • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
  • 29. © 2014 MapR Technologies 40© 2014 MapR Technologies Apache Spark
  • 30. © 2014 MapR Technologies 42 Spark SQL Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN Unified Platform
  • 31. © 2014 MapR Technologies 43 Why Apache Spark? • Distributed Parallel Cluster computing – Scalable , Fault tolerant • Programming Framework – APIs in Java, Scala, Python – Less code than map reduce • Fast – Runs computations in memory – Tasks are threads
  • 32. © 2014 MapR Technologies 46 Spark Use Cases • Iterative Algorithms on large amounts of data • Some Example Use Cases: – Anomaly detection – Classification – Predictions – Recommendations
  • 33. © 2014 MapR Technologies 47 Why Iterative Algorithms • Algorithms that need iterations – Clustering (K-Means, Canopy, …) – Gradient descent (e.g., Logistic Regression, Matrix Factorization) – Graph Algorithms (e.g., PageRank, Line-Rank, components, paths, reachability, centrality, ) – Alternating Least Squares ALS – Graph communities / dense sub-components – Inference (believe propagation) – … 47
  • 34. © 2014 MapR Technologies 48 Data Sources • Local Files • S3 • Hadoop Distributed Filesystem – any Hadoop InputFormat • HBase • other NoSQL data stores
  • 35. © 2014 MapR Technologies 51© 2014 MapR Technologies How Spark Works
  • 36. © 2014 MapR Technologies 52 Spark Programming Model sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Driver Program SparkContext cluster Worker Node Task Task Task Worker Node
  • 37. © 2014 MapR Technologies 53 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements
  • 38. © 2014 MapR Technologies 54 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements • operated on in parallel • Cached in memory – Or on disk • Fault tolerant
  • 39. © 2014 MapR Technologies 55 Working With RDDs RDD textFile = sc.textFile(”SomeFile.txt”)
  • 40. © 2014 MapR Technologies 56 Working With RDDs RDD RDD RDD RDD Transformations linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line) linesRDD = sc.textFile(”LogFile.txt”)
  • 41. © 2014 MapR Technologies 57 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithErrorRDD.count() 6 linesWithErrorRDD.first() # Error line textFile = sc.textFile(”SomeFile.txt”) linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)
  • 42. © 2014 MapR Technologies 58 Example Spark Word Count in Scala ...the ... "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax andtime and the, 1 time, 1 and, 1 and, 1 and, 12time, 4 ...the, 20 // Load our input data. val input = sc.textFile(inputFile) // Split it up into words. val words = input.flatMap(line => line.split(" ")) // Transform into pairs and count. val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} // Save the word count back out to a text file, counts.saveAsTextFile(outputFile) the, 20 time, 4 ….. and, 12 .........
  • 43. © 2014 MapR Technologies 59 Example Spark Word Count in Scala 59 HadoopRDD textFile // Load input data. val input = sc.textFile(inputFile) RDD partitions MapPartitionsRDD
  • 44. © 2014 MapR Technologies 60 Example Spark Word Count in Scala 60 // Load our input data. val input = sc.textFile(inputFile) // Split it up into words. val words = input.flatMap(line => line.split(" ")) HadoopRDD textFile flatmap MapPartitionsRDD MapPartitionsRDD
  • 45. © 2014 MapR Technologies 61 FlatMap flatMap line => line.split(" ")) 1 to many mapping Ships Ships and wax and wax RDD<String> wordsRDD
  • 46. © 2014 MapR Technologies 62 Example Spark Word Count in Scala 62 textFile flatmap map val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) // Transform into pairs val counts = words.map(word => (word, 1)) HadoopRDD MapPartitionsRDD MapPartitionsRDD MapPartitionsRDD
  • 47. © 2014 MapR Technologies 63 Map map word => (word, 1)) 1 to 1 mapping PairRDD<String, Integer> word1s Ships and wax Ships,1 and, 1 wax, 1
  • 48. © 2014 MapR Technologies 64 Example Spark Word Count in Scala 64 textFile flatmap map reduceByKey val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} HadoopRDD MapPartitionsRDD MapPartitionsRDD ShuffledRDD MapPartitionsRDD
  • 49. © 2014 MapR Technologies 65 reduceByKey reduceByKey case (x, y) => x + y and, 1 , 1 wax, 1 and, 2 PairRDD<String, Integer> coun PairRDD Function : reduceByKey(func: (value, value) ⇒ value): RDD[(key, value) wax, 1 and, 1 and, 1 wax, 1 case (0, 1) => 0 + 1 case (1, 1) => 1 + 1
  • 50. © 2014 MapR Technologies 66 Example Spark Word Count in Scala textFile flatmap map reduceByKey val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} val countArray = counts.collect() HadoopRDD MapPartitionsRDD MapPartitionsRDD MapPartitionsRDD collect ShuffledRDD Array
  • 51. © 2014 MapR Technologies 67 Demo Interactive Shell • Iterative Development – Cache those RDDs – Open the shell and ask questions • We have all wished we could do this with MapReduce – Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark
  • 52. © 2014 MapR Technologies 68© 2014 MapR Technologies Components Of Execution
  • 53. © 2014 MapR Technologies 69 Spark RDD DAG -> Physical Execution plan HadoopRDD sc.textfile(…) MapPartitionsRDD flatmap flatmap reduceByKey RDD Graph Physical Plan collect MapPartitionsRDD ShuffledRDD MapPartitionsRDD Stage 1 Stage 2
  • 54. © 2014 MapR Technologies 70 Physical Plan DAG Stage 1 Stage 2 Task Task Task Task Task Task Task Stage 1 Stage 2 Split into Tasks HFile HDFS Data Node Worker Node block cache partition Executor HFile block HFileHFile Task thread Task Set Task Scheduler Task Physical Execution plan -> Stages and Tasks
  • 55. © 2014 MapR Technologies 71 Summary of Components • Task : unit of execution • Stage: Group of Tasks – Base on partitions of RDD – Tasks run in parallel • DAG : Logical Graph of RDD operations • RDD : Parallel dataset with partitions 71
  • 56. © 2014 MapR Technologies 72 How Spark Application runs on a Hadoop cluster HFile HDFS Data Node Worker Node block cache partitiontask task Executor HFile block HFileHFile SparkContext zookeeper YARN Resource Manager HFile HDFS Data Node Worker Node block cache partitiontask task Executor HFile block HFileHFile Client node sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Driver Program Yarn Node Manger Yarn Node Manger
  • 57. © 2014 MapR Technologies 73 Deploying Spark – Cluster Manager Types • Standalone mode • Mesos • YARN • EC2 • GCE
  • 58. © 2014 MapR Technologies 74 RDD Transformations and Actions RDD RDD RDD RDDTransformations Action Value Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Actions (return a value) reduce collect count save lookupKey …
  • 59. © 2014 MapR Technologies 75 MapR Tutorial: Getting Started with Spark on MapR Sandbox • https://www.mapr.com/products/mapr-sandbox- hadoop/tutorials/spark-tutorial
  • 60. © 2014 MapR Technologies 76 MapR Blog: Getting Started with the Spark Web UI • https://www.mapr.com/blog/getting-started-spark-web-ui
  • 61. © 2014 MapR Technologies 77© 2014 MapR Technologies Example: Log Mining
  • 62. © 2014 MapR Technologies 78 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Based on slides from Pat McDonough at
  • 63. © 2014 MapR Technologies 79 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver
  • 64. © 2014 MapR Technologies 80 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”)
  • 65. © 2014 MapR Technologies 81 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”) Base RDD
  • 66. © 2014 MapR Technologies 82 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver
  • 67. © 2014 MapR Technologies 83 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver Transformed RDD
  • 68. © 2014 MapR Technologies 84 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count()
  • 69. © 2014 MapR Technologies 85 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Action
  • 70. © 2014 MapR Technologies 86 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3
  • 71. © 2014 MapR Technologies 87 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver tasks tasks tasks
  • 72. © 2014 MapR Technologies 88 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Read HDFS Block Read HDFS Block Read HDFS Block
  • 73. © 2014 MapR Technologies 89 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data
  • 74. © 2014 MapR Technologies 90 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 results results results
  • 75. © 2014 MapR Technologies 91 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count()
  • 76. © 2014 MapR Technologies 92 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() tasks tasks tasks Driver
  • 77. © 2014 MapR Technologies 93 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Process from Cache Process from Cache Process from Cache
  • 78. © 2014 MapR Technologies 94 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver results results results
  • 79. © 2014 MapR Technologies 95 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Cache your data  Faster Results
  • 80. © 2014 MapR Technologies 96© 2014 MapR Technologies Dataframes
  • 81. © 2014 MapR Technologies 97 Scenario: Online Auction Data AuctionID Bid Bid Time Bidder Bidder Rate Open Bid Price Item Days To Live 8213034705 95 2.927373 jake7870 0 95 117.5 xbox 3 8213034705 115 2.943484 davidbresler2 1 95 117.5 xbox 3 8213034705 100 2.951285 gladimacowgirl 58 95 117.5 xbox 3 8213034705 117.5 2.998947 daysrus 95 95 117.5 xbox 3
  • 82. © 2014 MapR Technologies 98 Creating DataFrames // Define schema using a Case class case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: String, bidderrate: Int, openbid: Float, finprice: Float, itemtype: String, dtl: Int)
  • 83. © 2014 MapR Technologies 99 Creating DataFrames // Create the RDD val auctionRDD=sc.textFile(“/user/user01/data/ebay.csv”) .map(_.split(”,")) // Map the data to the Auctions class val auctions=auctionRDD.map(a=>Auction(a(0), a(1).toFloat, a(2).toFloat, a(3), a(4).toInt, a(5).toFloat, a(6).toFloat, a(7), a(8).toInt))
  • 84. © 2014 MapR Technologies 100 Creating DataFrames // Convert RDD to DataFrame val auctionsDF=auctions.toDF() //To see the data auctionsDF.show() auctionid bid bidtime bidder bidderrate openbid price item daystolive 8213034705 95.0 2.927373 jake7870 0 95.0 117.5 xbox 3 8213034705 115.0 2.943484 davidbresler2 1 95.0 117.5 xbox 3 …
  • 85. © 2014 MapR Technologies 101 Inspecting Data: Examples //To see the schema auctionsDF.printSchema() root |-- auctionid: string (nullable = true) |-- bid: float (nullable = false) |-- bidtime: float (nullable = false) |-- bidder: string (nullable = true) |-- bidderrate: integer (nullable = true) |-- openbid: float (nullable = false) |-- price: float (nullable = false) |-- item: string (nullable = true) |-- daystolive: integer (nullable = true)
  • 86. © 2014 MapR Technologies 102 Inspecting Data: Examples //Total number of bids val totbids=auctionsDF.count() //How many bids per item? auction.groupBy("auctionid", "item").count.show auctionid item count 3016429446 palm 10 8211851222 xbox 28 3014480955 palm 12 8214279576 xbox 4 3014844871 palm 18 3014012355 palm 35 1641457876 cartier 2
  • 87. © 2014 MapR Technologies 103 Creating DataFrames // Register the DF as a table auctionsDF.registerTempTable(“auctionsDF”) val results =sqlContext.sql("SELECT auctionid, MAX(price) as maxprice FROM auction GROUP BY item,auctionid") results.show() auctionid maxprice 3019326300 207.5 8213060420 120.0 . . .
  • 88. © 2014 MapR Technologies 104 The physical plan for DataFrames
  • 89. © 2014 MapR Technologies 105 DataFrame Excecution plan // Print the physical plan to the console auction.select("auctionid").distinct.explain() == Physical Plan == Distinct false Exchange (HashPartitioning [auctionid#0], 200) Distinct true Project [auctionid#0] PhysicalRDD [auctionid#0,bid#1,bidtime#2,bidder#3, bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37
  • 90. © 2014 MapR Technologies 106 MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data • https://www.mapr.com/blog/using-apache-spark-dataframes- processing-tabular-data
  • 91. © 2014 MapR Technologies 107© 2014 MapR Technologies Machine Learning
  • 92. © 2014 MapR Technologies 108 Collaborative Filtering with Spark • Recommend Items – (filtering) • Based on User preferences data – (collaborative)
  • 93. © 2014 MapR Technologies 109 Alternating Least Squares • approximates sparse user item rating matrix • as product of two dense matrices, User and Item factor matrices
  • 94. © 2014 MapR Technologies 110 Parse Input // parse input UserID::MovieID::Rating def parseRating(str: String): Rating= { val fields = str.split("::") Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble) } // create an RDD of Ratings objects val ratingsRDD = ratingText.map(parseRating).cache()
  • 95. © 2014 MapR Technologies 111 ML Cross Validation Process Data Model Training/ Building Model Testing Test Set Train Test loop Training Set
  • 96. © 2014 MapR Technologies 112 Create Model // Randomly split ratings RDD into training data RDD (80%) and test data RDD (20%) val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L) val trainingRatingsRDD = splits(0).cache() val testRatingsRDD = splits(1).cache() // build a ALS user product matrix model with rank=20, iterations=10 val model = (new ALS().setRank(20).setIterations(10).run(trainingRatingsRDD))
  • 97. © 2014 MapR Technologies 113 Get predictions // Get the top 5 movie predictions for user 4169 val topRecsForUser = model.recommendProducts(4169, 5) // get (user,product) pairs val testUserProductRDD = testRatingsRDD.map { case Rating(user, product, rating) => (user, product) } // get predictions for (user,product) pairs val predictionsForTestRDD = model.predict(testUserProductRDD)
  • 98. © 2014 MapR Technologies 114 Test Model // prepare predictions for comparison val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } // prepare test for comparison val testKeyedByUserProductRDD = testRatingsRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } //Join the test with predictions val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD.join(predictionsKeyedByUserProductRDD)
  • 99. © 2014 MapR Technologies 115 Test Model val falsePositives =(testAndPredictionsJoinedRDD.filter{ case ((user, product), (ratingT, ratingP)) => (ratingT <= 1 && ratingP >=4) }) falsePositives.take(2) Array[((Int, Int), (Double, Double))] = ((3842,2858),(1.0,4.106488210964762)), ((6031,3194),(1.0,4.790778049100913))
  • 100. © 2014 MapR Technologies 116 Test Model //Evaluate the model using Mean Absolute Error (MAE) between test and predictions val meanAbsoluteError = testAndPredictionsJoinedRDD.map { case ((user, product), (testRating, predRating)) => val err = (testRating - predRating) Math.abs(err) }.mean() meanAbsoluteError: Double = 0.7244940545944053
  • 101. © 2014 MapR Technologies 117 Soon to Come • Spark On Demand Training – https://www.mapr.com/services/mapr-academy/ • Blogs and Tutorials: – Movie Recommendations with Collaborative Filtering – Spark Streaming
  • 102. © 2014 MapR Technologies 118 Machine Learning Blog • https://www.mapr.com/blog/parallel-and-iterative-processing- machine-learning-recommendations-spark
  • 103. © 2014 MapR Technologies 119© 2014 MapR Technologies Examples and Resources
  • 104. © 2014 MapR Technologies 120 Coming soon Free Spark On Demand Training • https://www.mapr.com/services/mapr-academy/
  • 105. © 2014 MapR Technologies 121 Find my presentation and other related resources here: http://events.mapr.com/JacksonvilleJava (you can find this link in the event’s page at meetup.com) Today’s Presentation Whiteboard & demo videos Free On-Demand Training Free eBooks Free Hadoop Sandbox And more…
  • 106. © 2014 MapR Technologies 122 Spark on MapR • Certified Spark Distribution • Fully supported and packaged by MapR in partnership with Databricks – mapr-spark package with Spark, Shark, Spark Streaming today – Spark-python, GraphX and MLLib soon • YARN integration – Spark can then allocate resources from cluster when needed
  • 107. © 2014 MapR Technologies 123 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  • 108. © 2014 MapR Technologies 124 Q&A @mapr maprtech kbotzum@mapr.com Engage with us! MapR maprtech mapr-technologies