More Related Content Similar to Introduction to Apache Spark (20) Introduction to Apache Spark1. © 2015 IBM Corporation
Introduction to Apache Spark
Vincent Poncet
IBM Software Big Data Technical Sale
02/07/2015
2. 2 © 2015 IBM Corporation
Credits
This presentations draws upon previous work / slides by IBM
colleagues from WW Software Big Data Organization : Daniel
Kikuchi, Jacques Roy and Mokhtar Kandil
I used several materials from DataBricks and Apache Spark
documentation
3. 3 © 2015 IBM Corporation
Introduction and background
Spark Core API
Spark Execution Model
Spark Shell & Application Deployment
Spark Extensions (SparkSQL, MLlib, Spark Streaming)
Spark Future
Agenda
4. 4 © 2015 IBM Corporation
Introduction and background
5. 5 © 2015 IBM Corporation
Apache Spark is a fast, general purpose,
easy-to-use cluster computing system for
large-scale data processing
– Fast
• Leverages aggressively cached in-memory
distributed computing and dedicated Executor
processes even when no jobs are running
• Faster than MapReduce
– General purpose
• Covers a wide range of workloads
• Provides SQL, streaming and complex
analytics
– Flexible and easier to use than Map Reduce
• Spark is written in Scala, an object oriented,
functional programming language
• Scala, Python and Java APIs
• Scala and Python interactive shells
• Runs on Hadoop, Mesos, standalone or cloud
Logistic regression in Hadoop and Spark
Spark Stack
val wordCounts =
sc.textFile("README.md").flatMap(line =>
line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b)
WordCount
6. 6 © 2015 IBM Corporation
Brief History of Spark
2002 – MapReduce @ Google
2004 – MapReduce paper
2006 – Hadoop @ Yahoo
2008 – Hadoop Summit
2010 – Spark paper
2013 – Spark 0.7 Apache Incubator
2014 – Apache Spark top-level
2014 – 1.2.0 release in December
2015 – 1.3.0 release in March
2015 – 1.4.0 release in June
Spark is HOT!!!
Most active project in Hadoop
ecosystem
One of top 3 most active Apache
projects
Databricks founded by the creators
of Spark from UC Berkeley’s
AMPLab
Activity for 6 months in 2014
(from Matei Zaharia – 2014 Spark Summit)
DataBricks
In June 2015, code base was about 400K lines
7. 7 © 2015 IBM Corporation
DataBricks / Spark Summit 2015
8. 8 © 2015 IBM Corporation
Large Scale Usage
DataBricks / Spark Summit 2015
9. 9 © 2015 IBM Corporation
Spark ecosystem
Spark is quite versatile and flexible:
– Can run on YARN / HDFS but also standalone or on MESOS
– The general processing capabilities of the Spark engine can be exploited from
multiple “entry points”: SQL, Streaming, Machine Learning, Graph Processing
10. 10 © 2015 IBM Corporation
Spark in the Hadoop ecosystem
Currently, Spark is a general purpose parallel processing engine
which integrates with YARN along the rest of the Hadoop frameworks
YARN
HDFS
Map/
Reduce 2
HivePig
Spark
HBase BigSQL Impala
11. 11 © 2015 IBM Corporation
Future of Spark’s role in Hadoop ?
The Spark Core engine is a good performant replacement for Map
Reduce:
YARN
HDFS
Spark Core
BigSQL
Spark
SQL
Spark
MLlib
Spark
Streaming
Hive
Custom
code
HBase
13. 13 © 2015 IBM Corporation
An RDD is a distributed collection of Scala/Python/Java objects of
the same type:
– RDD of strings
– RDD of integers
– RDD of (key, value) pairs
– RDD of class Java/Python/Scala objects
An RDD is physically distributed across the cluster, but manipulated
as one logical entity:
– Spark will “distribute” any required processing to all partitions where the RDD
exists and perform necessary redistributions and aggregations as well.
– Example: Consider a distributed RDD “Names” made of names
Resilient Distributed Dataset (RDD): definition
Mokhtar
Jacques
Dirk
Cindy
Dan
Susan
Dirk
Frank
Jacques
Partition 1 Partition 2 Partition 3
Names
14. 14 © 2015 IBM Corporation
Suppose we want to know the number of names in the RDD “Names”
User simply requests: Names.count()
– Spark will “distribute” count processing to all partitions so as to obtain:
• Partition 1: Mokhtar(1), Jacques (1), Dirk (1) 3
• Partition 2: Cindy (1), Dan (1), Susan (1) 3
• Partition 3: Dirk (1), Frank (1), Jacques (1) 3
– Local counts are subsequently aggregated: 3+3+3=9
To lookup the first element in the RDD: Names.first()
To display all elements of the RDD: Names.collect() (careful with this)
Resilient Distributed Dataset: definition
Mokhtar
Jacques
Dirk
Cindy
Dan
Susan
Dirk
Frank
Jacques
Partition 1 Partition 2 Partition 3
Names
15. 15 © 2015 IBM Corporation
Resilient Distributed Datasets: Creation and Manipulation
Three methods for creation
– Distributing a collection of objects from the driver program (using the
parallelize method of the spark context)
val rddNumbers = sc.parallelize(1 to 10)
val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”))
– Loading an external dataset (file)
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
– Transformation from another existing RDD
val rddNumbers2 = rddNumbers.map(x=> x+1)
Dataset from any storage supported by Hadoop
– HDFS, Cassandra, HBase, Amazon S3
– Others
File types supported
– Text files, SequenceFiles, Parquet, JSON
– Hadoop InputFormat
16. 16 © 2015 IBM Corporation
Resilient Distributed Datasets: Properties
Immutable
Two types of operations
– Transformations ~ DDL (Create View V2 as…)
• val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10
• val rddNumbers2 = rddNumbers.map (x => x+1): Numbers from 2 to 11
• The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded
• It’s a Directed Acyclic Graph (DAG)
• No actual data processing does take place Lazy evaluations
– Actions ~ DML (Select * From V2…)
• rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
• Performs transformations and action
• Returns a value (or write to a file)
Fault tolerance
– If data in memory is lost it will be recreated from lineage
Caching, persistence (memory, spilling, disk) and check-pointing
17. 17 © 2015 IBM Corporation
RDD Transformations
Transformations are lazy evaluations
Returns a pointer to the transformed RDD
Pair RDD (K,V) functions for MapReduce style transformations
Transformation Meaning
map(func) Return a new dataset formed by passing each element of the source through a function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which func returns
true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items. So func should
return a Seq rather than a single item
Full documentation at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
join(otherDataset,
[numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all
pairs of elements for each key.
reduceByKey(func) When called on a dataset of (K, V) pairs, returns a dataset of (K,V) pairs where the values for
each key are aggregated using the given reduce function func
sortByKey([ascendin
g],[numTasks])
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K,V)
pairs sorted by keys in ascending or descending order.
combineByKey[C}(cr
eateCombiner,
mergeValue,
mergeCombiners))
Generic function to combine the elements for each key using a custom set of aggregation
functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C.
createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C)
18. 18 © 2015 IBM Corporation
RDD Actions
Actions returns values or save a RDD to disk
Action Meaning
collect() Return all the elements of the dataset as an array of the driver program. This
is usually useful after a filter or another operation that returns a sufficiently
small subset of data.
count() Return the number of elements in a dataset.
first() Return the first element of the dataset
take(n) Return an array with the first n elements of the dataset.
foreach(func) Run a function func on each element of the dataset.
saveAsTextFile Save the RDD into a TextFile
Full documentation at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
19. 19 © 2015 IBM Corporation
RDD Persistence
Each node stores any partitions of the cache that it computes in memory
Reuses them in other actions on that dataset (or datasets derived from it)
– Future actions are much faster (often by more than 10x)
Two methods for RDD persistence: persist() and cache()
Storage Level Meaning
MEMORY_ONLY Store as deserialized Java objects in the JVM. If the RDD does not fit in memory, part of
it will be cached. The other will be recomputed as needed. This is the default. The
cache() method uses this.
MEMORY_AND_DISK Same except also store on disk if it doesn’t fit in memory. Read from memory and disk
when needed.
MEMORY_ONLY_SER Store as serialized Java objects (one bye array per partition). Space efficient, but more
CPU intensive to read.
MEMORY_AND_DISK_SER Similar to MEMORY_AND_DISK but stored as serialized objects.
DISK_ONLY Store only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc.
Same as above, but replicate each partition on two cluster nodes
OFF_HEAP (experimental) Store RDD in serialized format in Tachyon.
20. 20 © 2015 IBM Corporation
Scala
Scala Crash Course
Holden Karau, DataBricks
http://lintool.github.io/SparkTutorial/slides/day1_Scala_crash_course
.pdf
21. 21 © 2015 IBM Corporation
Code Execution (1)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt
‘spark-shell’ provides Spark context as ‘sc’
22. 22 © 2015 IBM Corporation
Code Execution (2)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt RDD: quotes
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
23. 23 © 2015 IBM Corporation
Code Execution (3)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt RDD: quotes RDD: danQuotes
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
DAN Spark is cool
DAN Scala is awesome
24. 24 © 2015 IBM Corporation
Code Execution (4)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt RDD: quotes RDD: danQuotes RDD: danSpark
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
DAN Spark is cool
DAN Scala is awesome
Spark
Scala
25. 25 © 2015 IBM Corporation
Code Execution (5)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt
HadoopRDD
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
RDD: quotes
DAN Spark is cool
DAN Scala is awesome
RDD: danQuotes
Spark
Scala
RDD: danSpark
1
26. 26 © 2015 IBM Corporation
DataFrames
A DataFrame is a distributed collection of data organized into named columns. It is
conceptually equivalent to a table in a relational database, an R dataframe or Python Pandas,
but in a distributed manner and with query optimizations and predicate pushdown to the
underlying storage.
DataFrames can be constructed from a wide array of sources such as: structured data files,
tables in Hive, external databases, or existing RDDs.
Released in Spark 1.3
DataBricks / Spark Summit 2015
27. 27 © 2015 IBM Corporation
DataFrames Examples
// Create the DataFrame
val df = sqlContext.read.parquet("examples/src/main/resources/people.parquet")
// Show the content of the DataFrame
df.show()
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Select only the "name" column
df.select("name").show()
// Select everybody, but increment the age by 1
df.select(df("name"), df("age") + 1).show()
// Select people older than 21
df.filter(df("age") > 21).show()
// Count people by age
df.groupBy("age").count().show()
28. 28 © 2015 IBM Corporation
Spark
Execution Model
29. 29 © 2015 IBM Corporation
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark client
(app master) Spark worker
HDFS, HBase, …
Block
manager
Task
threads
RDD graph
Scheduler
Block tracker
Shuffle tracker
Cluster
manager
Components
DataBricks
30. 30 © 2015 IBM Corporation
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
Scheduling Process
DataBricks
31. 31 © 2015 IBM Corporation
Pipelines narrow ops.
within a stage
Picks join algorithms
based on partitioning
(minimize shuffles)
Reuses previously
cached data
Scheduler Optimizations
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
DataBricks
32. 32 © 2015 IBM Corporation
Direct Acyclic Graph (DAG)
View the lineage
Could be issued in a continuous line
scala> danSpark.toDebugString
res1: String =
(2) MappedRDD[4] at map at <console>:16
| MappedRDD[3] at map at <console>:16
| FilteredRDD[2] at filter at <console>:14
| hdfs:/sparkdata/sparkQuotes.txt MappedRDD[1] at textFile at <console>:12
| hdfs:/sparkdata/sparkQuotes.txt HadoopRDD[0] at textFile at <console>:12
val danSpark = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt").
filter(_.startsWith("DAN")).
map(_.split(" ")).
map(x => x(1)).
.filter(_.contains("Spark"))
danSpark.count()
33. 33 © 2015 IBM Corporation
Showing Multiple Apps
SparkContext
Driver Program
Cluster Manager
Worker Node
Executor
Task Task
Cache
Worker Node
Executor
Task Task
Cache
App
Each Spark application runs as a set of processes coordinated by the
Spark context object (driver program)
– Spark context connects to Cluster Manager (standalone, Mesos/Yarn)
– Spark context acquires executors (JVM instance)
on worker nodes
– Spark context sends tasks to the executors
DataBricks
34. 34 © 2015 IBM Corporation
Spark Terminology
Context (Connection):
– Represents a connection to the Spark cluster. The Application which initiated
the context can submit one or several jobs, sequentially or in parallel, batch or
interactively, or long running server continuously serving requests.
Driver (Coordinator agent)
– The program or process running the Spark context. Responsible for running
jobs over the cluster and converting the App into a set of tasks
Job (Query / Query plan):
– A piece of logic (code) which will take some input from HDFS (or the local
filesystem), perform some computations (transformations and actions) and
write some output back.
Stage (Subplan)
– Jobs are divided into stages
Tasks (Sub section)
– Each stage is made up of tasks. One task per partition. One task is executed
on one partition (of data) by one executor
Executor (Sub agent)
– The process responsible for executing a task on a worker node
Resilient Distributed Dataset
35. 35 © 2015 IBM Corporation
Spark
Shell & Application Deployment
36. 36 © 2015 IBM Corporation
Spark’s Scala and Python Shell
Spark comes with two shells
– Scala
– Python
APIs available for Scala, Python and Java
Appropriate versions for each Spark release
Spark’s native language is Scala, more natural to write Spark
applications using Scala.
This presentation will focus on code examples in Scala
37. 37 © 2015 IBM Corporation
Spark’s Scala and Python Shell
Powerful tool to analyze data interactively
The Scala shell runs on the Java VM
– Can leverage existing Java libraries
Scala:
– To launch the Scala shell (from Spark home directory):
./bin/spark-shell
– To read in a text file:
scala> val textFile = sc.textFile("README.txt")
Python:
– To launch the Python shell (from Spark home directory):
./bin/pyspark
– To read in a text file:
>>> textFile = sc.textFile("README.txt")
38. 38 © 2015 IBM Corporation
SparkContext in Applications
The main entry point for Spark functionality
Represents the connection to a Spark cluster
Create RDDs, accumulators, and broadcast variables on that
cluster
In the Spark shell, the SparkContext, sc, is automatically initialized
for you to use
In a Spark program, import some classes and implicit conversions
into your program:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
39. 39 © 2015 IBM Corporation
A Spark Standalone Application in Scala
Import statements
SparkConf and
SparkContext
Transformations
and Actions
40. 40 © 2015 IBM Corporation
Running Standalone Applications
Define the dependencies
– Scala simple.sbt
Create the typical directory structure with the files
Create a JAR package containing the application’s code.
– Scala: sbt package
Use spark-submit to run the program
Scala:
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
41. 41 © 2015 IBM Corporation
Spark Properties
Set application properties via the SparkConf object
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
Dynamically setting Spark properties
– SparkContext with an empty conf
val sc = new SparkContext(new SparkConf())
– Supply the configuration values during runtime
./bin/spark-submit --name "My app" --master local[4] --conf
spark.shuffle.spill=false --conf "spark.executor.extraJavaOptions=-
XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
– conf/spark-defaults.conf
Application web UI
http://<driver>:4040
42. 42 © 2015 IBM Corporation
Spark Configuration
Three locations for configuration:
– Spark properties
– Environment variables
conf/spark-env.sh
– Logging
log4j.properties
Override default configuration directory (SPARK_HOME/conf)
– SPARK_CONF_DIR
• spark-defaults.conf
• spark-env.sh
• log4j.properties
• etc.
43. 43 © 2015 IBM Corporation
Spark Monitoring
Three ways to monitor Spark applications
1. Web UI
• Default port 4040
• Available for the duration of the application
2. Metrics
• Based on the Coda Hale Metrics Library
• Report to a variety of sinks (HTTP, JMX, and CSV)
• /conf/metrics.properties
3. External instrumentations
• Ganglia
• OS profiling tools (dstat, iostat, iotop)
• JVM utilities (jstack, jmap, jstat, jconsole)
44. 44 © 2015 IBM Corporation
Running Spark Examples
Spark samples available in the examples directory
Run the examples (from Spark home directory):
./bin/run-example SparkPi
where SparkPi is the name of the sample application
45. 45 © 2015 IBM Corporation
Spark Extensions
46. 46 © 2015 IBM Corporation
Spark Extensions
Extensions to the core Spark API
Improvements made to the core are passed to these libraries
Little overhead to use with the Spark core
from http://spark.apache.org
47. 47 © 2015 IBM Corporation
Spark SQL
Process relational queries expressed in SQL (HiveQL)
Seamlessly mix SQL queries with Spark programs
In Spark since 1.0, refactored on top of DataFrames since 1.3
Provide a single interface for efficiently working with structured
data including Apache Hive, Parquet and JSON files
Leverages Hive frontend and metastore
– Compatibility with Hive data, queries
and UDFs
– HiveQL limitations may apply
– Not ANSI SQL compliant
– Little to no query rewrite optimization,
automatic memory management or
sophisticated workload management
Graduated from alpha status with Spark 1.3
Standard connectivity through JDBC/ODBC
48. 48 © 2015 IBM Corporation
Spark SQL - Getting Started
SQLContext created from SparkContext
// An existing SparkContext, sc
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
HiveContext created from SparkContext
// An existing SparkContext, sc
val sqlContext = new
org.apache.spark.sql.hive.HiveContext(sc)
Import a library to convert an RDD to a DataFrame
– Scala:
import sqlContext.implicits._
DataFrame data sources
– Inferring the schema using reflection
– Programmatic interface
49. 49 © 2015 IBM Corporation
Spark SQL - Inferring the Schema Using Reflection
The case class in Scala defines the schema of the table
case class Person(name: String, age: Int)
The arguments of the case class becomes the names of the columns
Create the RDD of the Person object and create a DataFrame
val people = sc.textFile("examples/src/main/resources/people.txt").
map(_.split(",")).
map(p => Person(p(0), p(1).trim.toInt)).toDF()
Register the DataFrame as a table
people.registerTempTable("people")
Run SQL statements using the sql method provided by the
SQLContext
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND
age <= 19")
The results of the queries are DataFrames and support all the normal
RDD operations
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
50. 50 © 2015 IBM Corporation
Spark SQL - Programmatic Interface
Use when you cannot define the case classes ahead of time
Three steps to create the Dataframe
1. Schema encoded as a String, import SparkSQL Struct types
val schemaString = “name age”
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
2. Create the schema represented by a StructType matching the structure of
the Rows in the RDD from step 1.
val schema = StructType( schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)))
3. Apply the schema to the RDD of Rows using the createDataFrame method.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
Then register the peopleSchemaRDD as a table
peopleDataFrame.registerTempTable("people")
Run the sql statements using the sql method:
val results = sqlContext.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)
51. 51 © 2015 IBM Corporation
SparkSQL - DataSources
Before : Spark 1.2.x
ParquetFile
– val parquetFile = sqlContext.parquetFile("people.parquet")
JSON :
– val df =
sqlContext.jsonFile("examples/src/main/resources/people.json")
Spark 1.3.x
Generic Load/Save
– val df = sqlContext.load(“<filename>", “<datasource
type>")
– df.save (“<filename>", “<datasource type>")
ParquetFile
– val df = sqlContext.load("people.parquet") //
(parquet unless otherwise configured
by spark.sql.sources.default)
– df.select("name",
"age").save("namesAndAges.parquet")
JSON
– val df = sqlContext.load("people.json", "json")
– df.select("name", "age").save("namesAndAges.json",
“json")
CSV (external package)
– val df = sqlContext.load("com.databricks.spark.csv",
Map("path" -> "cars.csv", "header" -> "true"))
– df.select("year", "model").save("newcars.csv",
"com.databricks.spark.csv")
Spark 1.4.x
Generic Load/Save
– val df = sqlContext.read.load(“<filename>", “<datasource type>")
– df.write.save (“<filename>", “<datasource type>")
ParquetFile
– val df = sqlContext.read.load("people.parquet") // (parquet unless
otherwise configured by spark.sql.sources.default)
– df.select("name", "age").write.save("namesAndAges.parquet")
JSON
– val df = sqlContext.read.load("people.json", "json")
– df.select("name", "age").write.save("namesAndAges.json", “json")
CSV (external package)
– val df =
sqlContext.read.format("com.databricks.spark.csv").option("heade
r", "true").load("cars.csv")
– df.select("year",
"model").write.format("com.databricks.spark.csv").save("newcars.
csv")
DataSource APIs provides generic methods to
manage connectors to any datasource (file, jdbc,
cassandra, mongodb, etc…). From Spark 1.3
DataSource APIs provides predicate pushdown
capabilities to leverage the performance of the
backend. Most connectors are available at
http://spark-packages.org/
52. 52 © 2015 IBM Corporation
Spark Streaming
Scalable, high-throughput, fault-tolerant stream processing of live
data streams
Write Spark streaming applications like Spark applications
Recovers lost work and operator state (sliding windows) out-of-the-
box
Uses HDFS and Zookeeper for high availability
Data sources also include TCP sockets, ZeroMQ or other customized
data sources
53. 53 © 2015 IBM Corporation
Spark Streaming - Internals
The input stream goes into Spark Steaming
Breaks up into batches of input data
Feeds it into the Spark engine for processing
Generate the final results in streams of batches
DStream - Discretized Stream
– Represents a continuous stream of data created from the input streams
– Internally, represented as a sequence of RDDs
54. 54 © 2015 IBM Corporation
Spark Streaming - Getting Started
Count the number of words coming in from the TCP socket
Import the Spark Streaming classes
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
Create the StreamingContext object
val conf =
new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
Create a DStream
val lines = ssc.socketTextStream("localhost", 9999)
Split the lines into words
val words = lines.flatMap(_.split(" "))
Count the words
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
Print to the console
wordCounts.print()
55. 55 © 2015 IBM Corporation
Spark Streaming - Continued
No real processing happens until you tell it
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
Code and application can be found in the NetworkWordCount
example
To run the example:
– Invoke netcat to start the data stream
– In a different terminal, run the application
./bin/run-example streaming.NetworkWordCount localhost 9999
56. 56 © 2015 IBM Corporation
Spark MLlib
Spark MLlib for machine learning
library
Since Spark 0.8
Provides common algorithms and
utilities
• Classification
• Regression
• Clustering
• Collaborative filtering
• Dimensionality reduction
Leverages in-memory cache of Spark
to speed up iteration processing
57. 57 © 2015 IBM Corporation
Spark MLlib - Getting Started
Use k-means clustering for set of latitudes and longitudes
Import the Spark MLlib classes
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
Create the SparkContext object
val conf = new SparkConf().setAppName("KMeans")
val sc = new SparkContext(conf)
Create a data RDD
val taxifile = sc.textFile("user/spark/sparkdata/nyctaxisub/*")
Create Vectors for input to algorithm
val taxi =
taxifile.map{line=>Vectors.dense(line.split(",").slice(3,5).map(_.toDouble))}
Run the k-means algorithm with 3 clusters and 10 iterations
val model = Kmeans.train(taxi,3,10)
val clusterCenters = model.clusterCenters.map(_.toArray)
Print to the console
clusterCenters.foreach(lines=>println(lines(0),lines(1)))
58. 58 © 2015 IBM Corporation
SparkML
SparkML provides an API to build ML pipeline (since Spark 1.3)
Similar to Python scikit-learn
SparkML provides abstraction for all steps of an ML workflow
Generic ML Workflow Real Life ML Workflow
Transformer: A Transformer is an algorithm which can transform
one DataFrame into another DataFrame. E.g., an ML model is a
Transformer which transforms an RDD with features into an
RDD with predictions.
Estimator: An Estimator is an algorithm which can be fit on a
DataFrame to produce a Transformer. E.g., a learning algorithm
is an Estimator which trains on a dataset and produces a model.
Pipeline: A Pipeline chains multiple Transformers and Estimators
together to specify an ML workflow.
Param: All Transformers and Estimators now share a common
API for specifying parameters. Xebia HUG France 06/2015
59. 59 © 2015 IBM Corporation
Spark GraphX
Flexible Graphing
–GraphX unifies ETL, exploratory analysis, and iterative graph
computation
–You can view the same data as both graphs and collections,
transform and join graphs with RDDs efficiently, and write custom
iterative graph algorithms with the API
Speed
–Comparable performance to the fastest specialized graph
processing systems.
Algorithms
–Choose from a growing library of graph algorithms
–In addition to a highly flexible API, GraphX comes
with a variety of graph algorithms
60. 60 © 2015 IBM Corporation
Spark R
Spark R is an R package that provides a light-weight front-end to use
Apache Spark from R
Spark R exposes the Spark API through the RDD class and allows
users to interactively run jobs from the R shell on a cluster.
Goal
– Make Spark R production ready
– Integration with MLlib
– Consolidations to the DataFrames and RDD concepts
First release in Spark 1.4.0 :
– Support of DataFrames
Spark 1.5
– Support of MLlib
61. 61 © 2015 IBM Corporation
Spark internals refactoring : Project Tungsten
Memory Management and Binary Processing:
leverage application semantics to manage memory
explicitly and eliminate the overhead of JVM object
model and garbage collection
Cache-aware computation: algorithms and data
structures to exploit memory hierarchy
Code generation: exploit modern compilers and
CPUs: allow efficient operation directly on binary data
DataBricks / Spark Summit 2015
62. 62 © 2015 IBM Corporation
Spark: Final Thoughts
Spark is a good replacement for MapReduce
– Higher performance
– Framework is easier to use than MapReduce (M/R)
– Powerful RDD & DataFrames concepts
– Big higher level libraries : SparkSQL, MLlib/ML, Streaming, GraphX
– Big ecosystem adoption
This is a very fast paced environment, so keep up !
– Lot of new features at each new release (major release each 3 months)
– Spark has the latest / best offer but things may change again
63. 63 © 2015 IBM Corporation
Resources
The Learning Spark O’Reilly book
Lab(s) this afternoon
The following course on big data university