Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training | Edureka

www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING

What to expect?
 Spark Overview
 Hadoop Overview
 Spark vs Hadoop
 Why Spark Hadoop?
 Using Hadoop With Spark
 Use Case
 Conclusion

Spark Overview

What is Spark?
 Apache Spark is an open-source cluster-computing framework for real
time processing developed by the Apache Software Foundation
 Spark provides an interface for programming entire clusters with implicit
data parallelism and fault-tolerance
 It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of computations
Reduction
in time
Parallel
Serial
Figure: Data Parallelism In Spark
Figure: Real Time Processing In Spark

Spark Overview
Polyglot: Can be programmed in Scala,
Java, Python and R
Spark is used in real-time
processing
Lazy Evaluation: Delays evaluation till needed
Real time computation & low latency because of
in-memory computation

Spark Ecosystem
Used for structured
data. Can run
unmodified hive
queries on existing
Hadoop deployment
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
Learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
Enables analytical
and interactive
apps for live
streaming data.
Package for R language to
enable R-users to leverage
Spark power from R shell
Machine learning
libraries being built
on top of Spark.
The core engine for entire Spark framework. Provides
utilities and architecture for other components
Graph Computation
engine (Similar to
Giraph). Combines data-
parallel and graph-
parallel concepts

Spark Features
Deployment
Powerful
Caching
Polyglot
Features
100x faster than
for large scale data
processing
Simple programming
layer provides powerful
caching and disk
persistence capabilities
Can be deployed through
Mesos, Hadoop via Yarn, or
Spark’s own cluster manger
Can be programmed
in Scala, Java,
Python and R
Speed
vs

Spark Use Cases
Twitter Sentiment Analysis
With Spark
Trending Topics can
be used to create
campaigns and attract
larger audience
Sentiment helps in
crisis management,
service adjusting and
target marketing
NYSE: Real Time Analysis
of Stock Market Data
Banking: Credit Card
Fraud Detection
Genomic Sequencing

Hadoop Overview

What is Hadoop?
Hadoop Cluster
Master
Slaves
Hadoop is a framework that allows us to store and process large
data sets in parallel and distributed fashion
HDFS
(Storage)
MapReduce
(Processing)
Allows parallel processing
of the data stored in HDFS
Allows to dump any kind of data
across the cluster

Hadoop Ecosystem

Hadoop Features
Economical
Scalability
Reliability
Features
Flexible with
all kinds of data
In-built capability of
integrating seamlessly
with cloud based services
Usage of commodity
hardware minimizes
the cost of ownership
Hadoop infrastructure
has in-built fault
tolerance features
Flexibility

Hadoop Use Cases
E-Commerce
Data Analytics
Politics: US Presidential
Election
Banking: Credit Card
Fraud Detection
Healthcare

Spark vs Hadoop

Spark vs Hadoop
Use Cases For Real Time Analytics
Banking Government Healthcare Telecommunications Stock Market
Process data in real-time
Easy to use
Faster processing
Our Requirements:
Handle input from multiple sources

Spark vs Hadoop
 Spark runs upto 100x times faster than
Hadoop.
 The in-memory processing in Spark is
what makes it faster than MapReduce.
 Spark is not considered as a replacement
but as an extension to Hadoop.
0
20
40
60
80
100
120
140
160
180
Page Rank Performance
Iteration
Time (s)
Hadoop
Basic Spark
Spark + Controlled
Partitioning
The best case as per our chart is when Spark is used alongside
Hadoop. Let us dive in and use Hadoop with Spark.

Why to use
Spark with Hadoop?

Why Spark Hadoop?
Using Spark and Hadoop together helps us to leverage Spark’s processing to utilize the
best of Hadoop’s HDFS and YARN.
Spark
StreamingCSV
Sequence
File
Avro
Parquet
HDFS Spark YARN
MapReduce
Storage Sources Input Data
Resource
Allocation
Optional
Processing
Input Data
Output Data

Using Hadoop with Spark
&
Spark can be used along
with MapReduce in the
same Hadoop cluster or
separately as a processing
framework
Spark applications can also
be run on YARN (Hadoop
NextGen)
MapReduce and Spark are used
together where MapReduce is
used for batch processing and
Spark for real-time processing
Spark can run on top
of HDFS to leverage
the distributed
replicated storage

YARN Deployment With Spark
Figure: Cluster Deployment Mode Figure: Client Deployment Mode
 In YARN-Cluster mode, the Spark driver runs inside an
application master process which is managed by YARN
 The client can go away after initiating the application
 In YARN-Client mode, the Spark driver runs in the client process
 The application master is only used for requesting resources from
YARN.
YARN Cluster Mode YARN Client Mode

Use Case – Sports Analysis
Using Spark Hadoop

Use Case
Kevin Durant, NBA MVP 2014Stephen Curry, NBA MVP 2015 & 2016 Joe Hassett, Highest 3 Pt Normalized LeBron James, NBA MVP ‘10, ’12 & ‘13
Problem Statement
To build a Sport Analysis system using Spark Hadoop for predicting game results and
player rankings for sports like Basketball, Football, Cricket, Soccer, etc.
We will demonstrate the same using Basketball for our use case.

Use Case – Flow Diagram
Huge amount of
Sports data
1
Data Stored
in HDFS
2
Using Spark
Processing for
Analysis
3
Calculate Top
Scorers Per Season
Predict the NBA Most
Valuable Player (MVP)
Compare Teams to
Predict Winners
4
4
Query 3
Query 1
Query 2
4
5

Use Case – Dataset

Use Case – Dataset
Figure: Dataset from http://www.basketball-reference.com/leagues/NBA_2016_per_game.html

Use Case – Initializing Spark Packages
//Importing the necessary packages
import org.apache.spark.rdd._
import org.apache.spark.rdd.RDD
import org.apache.spark.util.IntParam
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.util.StatCounter
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.mllib.linalg.{Vector,
Vectors}
import scala.collection.mutable.ListBuffer
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File

Use Case – Reading Data From HDFS
//Creating an object basketball containing our main() class
object basketball {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("basketball").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
for (i <- 1980 to 2016) {
println(i)
val yearStats =
sc.textFile(s"hdfs://localhost:9000/basketball/BasketballStats/leagues_NBA_$i*")
yearStats.filter(x => x.contains(",")).map(x =>
(i,x)).saveAsTextFile(s"hdfs://localhost:9000/basketball/BasketballStatsWithYear/
$i/")
}

Use Case – Parsing Data And Broadcasting
//Read in all the statistics
val
stats=sc.textFile("hdfs://localhost:9000/basketball/BasketballStatsWithYear4/*/*")
.repartition(sc.defaultParallelism)
//Filter out the junk rows and clean up data for errors
val filteredStats=stats.filter(line => !line.contains("FG%")).filter(line =>
line.contains(",")).map(line => line.replace("*","").replace(",,",",0,"))
filteredStats.cache()
//Parse statistics and save as Map
val txtStat =
Array("FG","FGA","FG%","3P","3PA","3P%","2P","2PA","2P%","eFG%","FT","FTA","FT%","
ORB","DRB","TRB","AST","STL","BLK","TOV","PF","PTS")
val aggStats = processStats(filteredStats,txtStat).collectAsMap
//Collect RDD into map and broadcast it into 'broadcastStats'
val broadcastStats = sc.broadcast(aggStats)

Use Case – Player Statistics Transformations
//Parse stats and normalize
val nStats =
filteredStats.map(x=>bbParse(x,broadcastStats.value,zBroadcastStats.value))
//Parse stats and track weights
val txtStatZ = Array("FG","FT","3P","TRB","AST","STL","BLK","TOV","PTS")
val zStats =
processStats(filteredStats,txtStatZ,broadcastStats.value).collectAsMap
//Collect RDD into Map and broadcast into 'zBroadcastStats'
val zBroadcastStats = sc.broadcast(zStats)
//Map RDD to RDD[Row] so that we can turn it into a DataFrame
val nPlayer = nStats.map(x =>
Row.fromSeq(Array(x.name,x.year,x.age,x.position,x.team,x.gp,x.gs,x.mp) ++ x.stats
++ x.statsZ ++ Array(x.valueZ) ++ x.statsN ++ Array(x.valueN)))

Use Case – Querying
through Spark SQL

Use Case – Getting All Player Statistics
//create schema for the data frame
val schemaN = StructType(
StructField("name", StringType, true) ::
StructField("year", IntegerType, true) ::
...
StructField("nTOT", DoubleType, true) :: Nil )
//Create DataFrame 'dfPlayersT' and register as 'tPlayers'
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val dfPlayersT = sqlContext.createDataFrame(nPlayer,schemaN)
dfPlayersT.registerTempTable("tPlayers")
//Create DataFrame 'dfPlayers' and register as 'Players'
val dfPlayers = sqlContext.sql("select age-min_age as exp,tPlayers.* from tPlayers
join (select name,min(age)as min_age from tPlayers group by name) as t1 on
tPlayers.name=t1.name order by tPlayers.name, exp ")
dfPlayers.registerTempTable("Players")

Use Case – Storing Best Players Into HDFS
//Calculate the best players of 2016
val mvp = sqlContext.sql("Select name, zTot from Players where year=2016
order by zTot desc").cache
mvp.show
//Storing the best players of 2016 into HDFS
mvp.write.format("csv").save("hdfs://localhost:9000/basketball/output.csv")
//Listing the full numbers of LeBron James
sqlContext.sql("Select * from Players where year=2016 and name='LeBron
James'").collect.foreach(println)
//Ranking the top 10 players on the average 3 pointers scored per game in 2016
sqlContext.sql("select name, 3p, z3p from Players where year=2016 order by
z3p desc").take(10).foreach(println)

Use Case –Storing Best Players Into HDFS
Best Player Of 2016 Most 3 Pointers In 2016
All Stats Of
LeBron James

Use Case – Sample Result File in HDFS
Figure: Output file containing top NBA players of 2016
Figure: Output directory in HDFS file system
Output directory path

Use Case – Highest 3 Point Shooters
//All time 3 point shooting ranking
sqlContext.sql("select name, 3p, z3p from Players order by 3p
desc").take(10).foreach(println)
//All time 3 point shooting ranking normalized to their leagues
sqlContext.sql("select name, 3p, z3p from Players order by z3p
desc").take(10).foreach(println)
//Calculate the average number of 3 pointers per game in 2016
broadcastStats.value("2016_3P_avg")
//Calculate the average number of 3 pointers per game in 1981
broadcastStats.value("1981_3P_avg")

Use Case – Highest 3 Point Shooters
Best All Time 3 Point Shooter Best All Time 3 Point Shooter
Normalized To Their Season

Use Case – Prediction
Analysis Results

Use Case – Who Will Be The 2016 NBA MVP?
LeBron James Stephen CurryJames Harden Russell WestbrookKobe BryantDwayne Wade
sqlContext.sql("select name, zTot from Players where year=2016
order by zTot desc").take(10).foreach(println)

Use Case – Predicting MVP 2016
As our model predicts, Stephen Curry is the MVP of NBA in 2016.
Hell Yeah! It matched with the actual NBA MVP of 2016.

Summary

Summary
Spark Overview
YARN Spark DeploymentWhy Spark Hadoop?
Hadoop Overview
Sport Analysis
Spark vs Hadoop

Conclusion

Conclusion
Congrats!
We have hence demonstrated the power of Spark Hadoop in Prediction Analytics.
The hands-on examples will give you the required confidence to work on any future
projects you encounter in Apache Spark and Hadoop.

Thank You …
Questions/Queries/Feedback

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training | Edureka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training | Edureka

Similar to Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training | Edureka (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training | Edureka