This Edureka Spark Hadoop Tutorial will help you understand how to use Spark and Hadoop together. This Spark Hadoop tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) Spark Overview
2) Hadoop Overview
3) Spark vs Hadoop
4) Why Spark Hadoop?
5) Using Hadoop With Spark
6) Use Case - Sports Analytics (NBA)
4. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What is Spark?
Apache Spark is an open-source cluster-computing framework for real
time processing developed by the Apache Software Foundation
Spark provides an interface for programming entire clusters with implicit
data parallelism and fault-tolerance
It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of computations
Reduction
in time
Parallel
Serial
Figure: Data Parallelism In Spark
Figure: Real Time Processing In Spark
5. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Overview
Polyglot: Can be programmed in Scala,
Java, Python and R
Spark is used in real-time
processing
Lazy Evaluation: Delays evaluation till needed
Real time computation & low latency because of
in-memory computation
6. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Ecosystem
Used for structured
data. Can run
unmodified hive
queries on existing
Hadoop deployment
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
Learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
Enables analytical
and interactive
apps for live
streaming data.
Package for R language to
enable R-users to leverage
Spark power from R shell
Machine learning
libraries being built
on top of Spark.
The core engine for entire Spark framework. Provides
utilities and architecture for other components
Graph Computation
engine (Similar to
Giraph). Combines data-
parallel and graph-
parallel concepts
7. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Features
Deployment
Powerful
Caching
Polyglot
Features
100x faster than
for large scale data
processing
Simple programming
layer provides powerful
caching and disk
persistence capabilities
Can be deployed through
Mesos, Hadoop via Yarn, or
Spark’s own cluster manger
Can be programmed
in Scala, Java,
Python and R
Speed
vs
8. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Use Cases
Twitter Sentiment Analysis
With Spark
Trending Topics can
be used to create
campaigns and attract
larger audience
Sentiment helps in
crisis management,
service adjusting and
target marketing
NYSE: Real Time Analysis
of Stock Market Data
Banking: Credit Card
Fraud Detection
Genomic Sequencing
10. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What is Hadoop?
Hadoop Cluster
Master
Slaves
Hadoop is a framework that allows us to store and process large
data sets in parallel and distributed fashion
HDFS
(Storage)
MapReduce
(Processing)
Allows parallel processing
of the data stored in HDFS
Allows to dump any kind of data
across the cluster
12. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Hadoop Features
Economical
Scalability
Reliability
Features
Flexible with
all kinds of data
In-built capability of
integrating seamlessly
with cloud based services
Usage of commodity
hardware minimizes
the cost of ownership
Hadoop infrastructure
has in-built fault
tolerance features
Flexibility
15. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark vs Hadoop
Use Cases For Real Time Analytics
Banking Government Healthcare Telecommunications Stock Market
Process data in real-time
Easy to use
Faster processing
Our Requirements:
Handle input from multiple sources
16. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark vs Hadoop
Spark runs upto 100x times faster than
Hadoop.
The in-memory processing in Spark is
what makes it faster than MapReduce.
Spark is not considered as a replacement
but as an extension to Hadoop.
0
20
40
60
80
100
120
140
160
180
Page Rank Performance
Iteration
Time (s)
Hadoop
Basic Spark
Spark + Controlled
Partitioning
The best case as per our chart is when Spark is used alongside
Hadoop. Let us dive in and use Hadoop with Spark.
18. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Why Spark Hadoop?
Using Spark and Hadoop together helps us to leverage Spark’s processing to utilize the
best of Hadoop’s HDFS and YARN.
Spark
StreamingCSV
Sequence
File
Avro
Parquet
HDFS Spark YARN
MapReduce
Storage Sources Input Data
Resource
Allocation
Optional
Processing
Input Data
Output Data
19. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Using Hadoop with Spark
&
Spark can be used along
with MapReduce in the
same Hadoop cluster or
separately as a processing
framework
Spark applications can also
be run on YARN (Hadoop
NextGen)
MapReduce and Spark are used
together where MapReduce is
used for batch processing and
Spark for real-time processing
Spark can run on top
of HDFS to leverage
the distributed
replicated storage
20. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
YARN Deployment With Spark
Figure: Cluster Deployment Mode Figure: Client Deployment Mode
In YARN-Cluster mode, the Spark driver runs inside an
application master process which is managed by YARN
The client can go away after initiating the application
In YARN-Client mode, the Spark driver runs in the client process
The application master is only used for requesting resources from
YARN.
YARN Cluster Mode YARN Client Mode
22. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case
Kevin Durant, NBA MVP 2014Stephen Curry, NBA MVP 2015 & 2016 Joe Hassett, Highest 3 Pt Normalized LeBron James, NBA MVP ‘10, ’12 & ‘13
Problem Statement
To build a Sport Analysis system using Spark Hadoop for predicting game results and
player rankings for sports like Basketball, Football, Cricket, Soccer, etc.
We will demonstrate the same using Basketball for our use case.
23. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Flow Diagram
Huge amount of
Sports data
1
Data Stored
in HDFS
2
Using Spark
Processing for
Analysis
3
Calculate Top
Scorers Per Season
Predict the NBA Most
Valuable Player (MVP)
Compare Teams to
Predict Winners
4
4
Query 3
Query 1
Query 2
4
5
27. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Reading Data From HDFS
//Creating an object basketball containing our main() class
object basketball {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("basketball").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
for (i <- 1980 to 2016) {
println(i)
val yearStats =
sc.textFile(s"hdfs://localhost:9000/basketball/BasketballStats/leagues_NBA_$i*")
yearStats.filter(x => x.contains(",")).map(x =>
(i,x)).saveAsTextFile(s"hdfs://localhost:9000/basketball/BasketballStatsWithYear/
$i/")
}
28. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Parsing Data And Broadcasting
//Read in all the statistics
val
stats=sc.textFile("hdfs://localhost:9000/basketball/BasketballStatsWithYear4/*/*")
.repartition(sc.defaultParallelism)
//Filter out the junk rows and clean up data for errors
val filteredStats=stats.filter(line => !line.contains("FG%")).filter(line =>
line.contains(",")).map(line => line.replace("*","").replace(",,",",0,"))
filteredStats.cache()
//Parse statistics and save as Map
val txtStat =
Array("FG","FGA","FG%","3P","3PA","3P%","2P","2PA","2P%","eFG%","FT","FTA","FT%","
ORB","DRB","TRB","AST","STL","BLK","TOV","PF","PTS")
val aggStats = processStats(filteredStats,txtStat).collectAsMap
//Collect RDD into map and broadcast it into 'broadcastStats'
val broadcastStats = sc.broadcast(aggStats)
29. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Player Statistics Transformations
//Parse stats and normalize
val nStats =
filteredStats.map(x=>bbParse(x,broadcastStats.value,zBroadcastStats.value))
//Parse stats and track weights
val txtStatZ = Array("FG","FT","3P","TRB","AST","STL","BLK","TOV","PTS")
val zStats =
processStats(filteredStats,txtStatZ,broadcastStats.value).collectAsMap
//Collect RDD into Map and broadcast into 'zBroadcastStats'
val zBroadcastStats = sc.broadcast(zStats)
//Map RDD to RDD[Row] so that we can turn it into a DataFrame
val nPlayer = nStats.map(x =>
Row.fromSeq(Array(x.name,x.year,x.age,x.position,x.team,x.gp,x.gs,x.mp) ++ x.stats
++ x.statsZ ++ Array(x.valueZ) ++ x.statsN ++ Array(x.valueN)))
31. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Getting All Player Statistics
//create schema for the data frame
val schemaN = StructType(
StructField("name", StringType, true) ::
StructField("year", IntegerType, true) ::
...
StructField("nTOT", DoubleType, true) :: Nil )
//Create DataFrame 'dfPlayersT' and register as 'tPlayers'
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val dfPlayersT = sqlContext.createDataFrame(nPlayer,schemaN)
dfPlayersT.registerTempTable("tPlayers")
//Create DataFrame 'dfPlayers' and register as 'Players'
val dfPlayers = sqlContext.sql("select age-min_age as exp,tPlayers.* from tPlayers
join (select name,min(age)as min_age from tPlayers group by name) as t1 on
tPlayers.name=t1.name order by tPlayers.name, exp ")
dfPlayers.registerTempTable("Players")
32. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Storing Best Players Into HDFS
//Calculate the best players of 2016
val mvp = sqlContext.sql("Select name, zTot from Players where year=2016
order by zTot desc").cache
mvp.show
//Storing the best players of 2016 into HDFS
mvp.write.format("csv").save("hdfs://localhost:9000/basketball/output.csv")
//Listing the full numbers of LeBron James
sqlContext.sql("Select * from Players where year=2016 and name='LeBron
James'").collect.foreach(println)
//Ranking the top 10 players on the average 3 pointers scored per game in 2016
sqlContext.sql("select name, 3p, z3p from Players where year=2016 order by
z3p desc").take(10).foreach(println)
35. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Highest 3 Point Shooters
//All time 3 point shooting ranking
sqlContext.sql("select name, 3p, z3p from Players order by 3p
desc").take(10).foreach(println)
//All time 3 point shooting ranking normalized to their leagues
sqlContext.sql("select name, 3p, z3p from Players order by z3p
desc").take(10).foreach(println)
//Calculate the average number of 3 pointers per game in 2016
broadcastStats.value("2016_3P_avg")
//Calculate the average number of 3 pointers per game in 1981
broadcastStats.value("1981_3P_avg")
38. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Who Will Be The 2016 NBA MVP?
LeBron James Stephen CurryJames Harden Russell WestbrookKobe BryantDwayne Wade
sqlContext.sql("select name, zTot from Players where year=2016
order by zTot desc").take(10).foreach(println)
43. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Conclusion
Congrats!
We have hence demonstrated the power of Spark Hadoop in Prediction Analytics.
The hands-on examples will give you the required confidence to work on any future
projects you encounter in Apache Spark and Hadoop.