2. About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” unit - IAF
7. Agenda
What is Spark?
Spark Infrastructure and Basics
Spark Features and Suite
Spark-Shell Live Demo
Windward Use Cases
◦ Ask the Hard Questions
Conclusion
8. What is Spark?
Efficient Usable
General execution
graphs
In-memory storage
Rich APIs in Java,
Scala, Python
Interactive shell
Fast and Expressive Cluster Computing
Engine Compatible with Apache Hadoop
9. What is Spark?
Apache Spark is a general-purpose, cluster
computing framework
Spark does computation In Memory
Spark is fast also for heavy operations that
run on disk
13. Spark Philosophy
Make life easy and productive for data
scientists
Well documented, expressive API’s
Powerful domain specific libraries
Easy integration with storage systems
… and caching to avoid data movement
Predictable releases, stable API’s
14. About Spark project
Spark is from US Berkeley and several
companies
Interactive Spark in Scala and Python
Currently stable in version 1.2
17. Driver and Spark Context
Spark Context is your connection to the
Spark cluster.
The driver program contains the main
method.
It defines the RDDs.
It applies operations to them.
The Driver use Spark Context to access your
cluster.
The variable named sc (for the Spark
Context) is already defined in your Driver.
(Spark Shell)
18. Resilient Distributed Datasets
RDDs are fault tolerant
Parallel data structure
RDDs are Immutable
Can persist intermediate results in
memory
Transformations are operators and are
Lazy evaluated
21. RDD Persistence and
partitioning
Users have control which RDD will be
reuse (in memory and disk storage)
◦ Persist, Cache, Unpersist
Users can an RDD’s to be partitioned
across machines
Only the lost partitions of an RDD
need to be recomputed upon failure.
22. Spark execution engine
Spark uses lazy evaluation
Runs the code only when it encounters
an action operation
There is no need to design and write a
single complex map-reduce job.
In Spark we can write smaller and
manageable operations
◦ Spark will group operations together
23. Spark execution engine
Serializes your code to the executors
In Java - functions are specified as
objects that implement one of Spark’s
Function interfaces.
24. Persistence layers for Spark
Distributed system
◦ Hadoop (HDFS)
◦ Local file system
◦ Amazon S3
◦ Cassandra
◦ Hive
◦ Hbase
File formats
◦ Text file
◦ Sequence File
◦ AVRO
◦ Parquet
◦ Other Hadoop formats
27. Spark Core Features
Distributed In memory Computation
Stand alone and Local Capabilities
History server for Spark UI
Resource management Integration
Unified job submission tool
28. History Server
Stand Alone Cluster
Integrates both with Yarn and Mesos
In Spark Standalone, history server is
embedded in the master.
In Yern/Mesos, run history server as a
daemon.
31. Java 8 Support
RDD operations can use lambda
syntax
class Split extends FlatMapFunction<String,
String> {
public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
);
JavaRDD<String> words = lines.flatMap(new
Split());
JavaRDD<String> words = lines
.flatMap(s -> Arrays.asList(s.split(" ")));
Old
New
36. Turning an RDD into a
Relation
// Define the schema using a case class.
case class Person(name: String, age: Int)
// Create an RDD of Person objects, register it as a table.
val people =
sc.textFile("examples/src/main/resources/people.txt")
.map(_.split(",")
.map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")
37. Querying using SQL
// SQL statements can be run directly on RDD’s
val teenagers =
sql("SELECT name FROM people
WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support
// normal RDD operations.
val nameList = teenagers.map(t => "Name: " +
t(0)).collect()
// Language integrated queries (ala LINQ)
val teenagers =
people.where('age >= 10).where('age <=
19).select('name)
38. Import and Export
// Save SchemaRDD’s directly to parquet
people.saveAsParquetFile("people.parquet"
)
// Load data stored in Hive
val hiveContext =
new
org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._
// Queries can be expressed in HiveQL.
hql("FROM src SELECT key, value")
39. Spark Streaming
Web UI for streaming
Graceful shutdown
User-defined input streams
Support for creating in Java
Refactored API
43. Windward Spark App
Architecture Written in Java
Running on a Yarn cluster (AWS Elastic Map Reduce)
Each App is tested via:
◦ Unit tests, End2End tests, Functional tests
Automation via JenkinsCI to 4 different environments
◦ dev, testing, staging, production
◦ Run functional testing (on the cluster itself)
Single configuration file for all apps, all environments
Each App has monitors running periodically, and per run, metrics
are reported to “Graphite”
Scheduling is currently done via a cronjob on the yarn-master
47. Conclusion
Spark is a popular and very powerful
distributed in memory computation
framework
Broadly used and has lots of contributors
Leading tool in the new world of Petabytes
of unexplored data in the world