SlideShare a Scribd company logo
1 of 192
Spark Camp @ Strata CA
Intro to Apache Spark with Hands-on Tutorials

Wed Feb 18, 2015 9:00am–5:00pm
download slides:

training.databricks.com/workshop/sparkcamp.pdf
Licensed under a Creative Commons Attribution-NonCommercial-
NoDerivatives 4.0 International License
Tutorial Outline:
morning afternoon
Welcome + Getting Started Intro to MLlib
Ex 1: Pre-Flight Check Songs Demo
How Spark Runs on a Cluster Spark SQL
A Brief History Visualizations
Ex 3: WC, Joins Ex 7: SQL + Visualizations
DBC Essentials Spark Streaming
How to “Think Notebooks” Tw Streaming / Kinesis Demo
Ex 4: Workflow exercise Building a Scala JAR
Tour of Spark API Deploying Apps
Spark @ Strata Ex 8: GraphX examples
Case Studies
Further Resources / Q&A
Introducing:
Reza Zadeh

@Reza_Zadeh
Hossein Falaki

@mhfalaki
Andy Konwinski

@andykonwinski
Chris Fregly

@cfregly
Denny Lee

@dennylee
Holden Karau

@holdenkarau
Krishna Sankar

@ksankar
Paco Nathan

@pacoid
Welcome + 

Getting Started
Everyone will receive a username/password for one 

of the Databricks Cloud shards. Use your laptop and
browser to login there.	

We find that cloud-based notebooks are a simple way
to get started using Apache Spark – as the motto
“Making Big Data Simple” states.	

Please create and run a variety of notebooks on your
account throughout the tutorial. These accounts will
remain open long enough for you to export your work.	

See the product page or FAQ for more details, or
contact Databricks to register for a trial account.
5
Getting Started: Step 1
6
Getting Started: Step 1 – Credentials
url:! ! https://class01.cloud.databricks.com/!
user:! ! student-777!
pass:! ! 93ac11xq23z5150!
cluster:! student-777!
7
Open in a browser window, then click on the
navigation menu in the top/left corner:
Getting Started: Step 2
8
The next columns to the right show folders,

and scroll down to click on databricks_guide
Getting Started: Step 3
9
Scroll to open the 01 Quick Start notebook, then
follow the discussion about using key features:
Getting Started: Step 4
10
See /databricks-guide/01 Quick Start 

Key Features:	

• Workspace / Folder / Notebook	

• Code Cells, run/edit/move/comment	

• Markdown	

• Results	

• Import/Export
Getting Started: Step 5
11
Click on the Workspace menu and create your 

own folder (pick a name):
Getting Started: Step 6
12
Navigate to /_SparkCamp/00.pre-flight-check

hover on its drop-down menu, on the right side:
Getting Started: Step 7
13
Then create a clone of this notebook in the folder
that you just created:
Getting Started: Step 8
14
Attach your cluster – same as your username:
Getting Started: Step 9
15
Now let’s get started with the coding exercise!	

We’ll define an initial Spark app in three lines 

of code:
Getting Started: Coding Exercise
If you’re new to this Scala thing and want to
spend a few minutes on the basics…
16
Scala Crash Course

Holden Karau

lintool.github.io/SparkTutorial/
slides/day1_Scala_crash_course.pdf
Getting Started: Bonus!
17
Getting Started: Extra Bonus!!
See also the /learning_spark_book 	

for all of its code examples in notebooks:
How Spark runs 

on a Cluster
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
19
Clone and run /_SparkCamp/01.log_example

in your folder:
Spark Deconstructed: Log Mining Example
Spark Deconstructed: Log Mining Example
20
# load error messages from a log into memory!
# then interactively search for patterns!
 !
# base RDD!
lines = sc.textFile("/mnt/paco/intro/error_log.txt") !
.map(lambda x: x.split("t"))!
 !
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
 !
# persistence!
messages.cache()!
 !
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
 !
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()
Spark Deconstructed: Log Mining Example
x = messages.filter(lambda x: x.find("mysql") > -1)!
x.toDebugString()!
!
!
(2) PythonRDD[772] at RDD at PythonRDD.scala:43 []!
| PythonRDD[219] at RDD at PythonRDD.scala:43 []!
| error_log.txt MappedRDD[218] at NativeMethodAccessorImpl.java:-2 []!
| error_log.txt HadoopRDD[217] at NativeMethodAccessorImpl.java:-2 []
Note that we can examine the operator graph
for a transformed RDD, for example:
21
Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
We start with Spark running on a cluster…

submitting code to be evaluated on it:
22
# base RDD!
lines = sc.textFile("/mnt/paco/intro/error_log.txt") !
.map(lambda x: x.split("t"))!
 !
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
 !
# persistence!
messages.cache()!
 !
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
 !
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()
Spark Deconstructed: Log Mining Example
discussing the other part
23
Driver
Worker
Worker
Worker
# base RDD!
lines = sc.textFile("/mnt/paco/intro/error_log.txt") !
.map(lambda x: x.split("t"))!
 !
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
 !
# persistence!
messages.cache()!
 !
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
 !
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()
Spark Deconstructed: Log Mining Example
24
Driver
Worker
Worker
Worker
block 1
block 2
block 3
discussing the other part
# base RDD!
lines = sc.textFile("/mnt/paco/intro/error_log.txt") !
.map(lambda x: x.split("t"))!
 !
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
 !
# persistence!
messages.cache()!
 !
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
 !
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()
Spark Deconstructed: Log Mining Example
discussing the other part
25
Driver
Worker
Worker
Worker
block 1
block 2
block 3
# base RDD!
lines = sc.textFile("/mnt/paco/intro/error_log.txt") !
.map(lambda x: x.split("t"))!
 !
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
 !
# persistence!
messages.cache()!
 !
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
 !
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()
Spark Deconstructed: Log Mining Example
26
Driver
Worker
Worker
Worker
block 1
block 2
block 3
read
HDFS
block
read
HDFS
block
read
HDFS
block
discussing the other part
# base RDD!
lines = sc.textFile("/mnt/paco/intro/error_log.txt") !
.map(lambda x: x.split("t"))!
 !
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
 !
# persistence!
messages.cache()!
 !
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
 !
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()
Spark Deconstructed: Log Mining Example
27
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process,
cache data
process,
cache data
process,
cache data
discussing the other part
# base RDD!
lines = sc.textFile("/mnt/paco/intro/error_log.txt") !
.map(lambda x: x.split("t"))!
 !
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
 !
# persistence!
messages.cache()!
 !
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
 !
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()
Spark Deconstructed: Log Mining Example
28
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
discussing the other part
# base RDD!
lines = sc.textFile("/mnt/paco/intro/error_log.txt") !
.map(lambda x: x.split("t"))!
 !
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
 !
# persistence!
messages.cache()!
 !
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
 !
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()
Spark Deconstructed: Log Mining Example
discussing the other part
29
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
# base RDD!
lines = sc.textFile("/mnt/paco/intro/error_log.txt") !
.map(lambda x: x.split("t"))!
 !
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
 !
# persistence!
messages.cache()!
 !
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
 !
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()
Spark Deconstructed: Log Mining Example
30
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process
from cache
process
from cache
process
from cache
discussing the other part
# base RDD!
lines = sc.textFile("/mnt/paco/intro/error_log.txt") !
.map(lambda x: x.split("t"))!
 !
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
 !
# persistence!
messages.cache()!
 !
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
 !
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()
Spark Deconstructed: Log Mining Example
31
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
discussing the other part
Looking at the RDD transformations and
actions from another perspective…
action value
RDD
RDD
RDD
transformations RDD
# base RDD!
lines = sc.textFile("/mnt/paco/intro/error_log.txt") !
.map(lambda x: x.split("t"))!
 !
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
 !
# persistence!
messages.cache()!
 !
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()!
 !
# action 2!
messages.filter(lambda x: x.find("php") > -1).count()
32
Spark Deconstructed: Log Mining Example
RDD
# base RDD!
lines = sc.textFile("/mnt/paco/intro/error_log.txt") !
.map(lambda x: x.split("t"))
33
Spark Deconstructed: Log Mining Example
RDD
RDD
RDD
transformations RDD
# transformed RDDs!
errors = lines.filter(lambda x: x[0] == "ERROR")!
messages = errors.map(lambda x: x[1])!
!
# persistence!
messages.cache()
34
Spark Deconstructed: Log Mining Example
action value
RDD
RDD
RDD
transformations RDD
# action 1!
messages.filter(lambda x: x.find("mysql") > -1).count()
35
Spark Deconstructed: Log Mining Example
A Brief History
37
A Brief History: Functional Programming for Big Data
circa late 1990s: 

explosive growth e-commerce and machine data
implied that workloads could not fit on a single
computer anymore…	

notable firms led the shift to horizontal scale-out 

on clusters of commodity hardware, especially 

for machine learning use cases at scale
38
A Brief History: Functional Programming for Big Data
2002
2002
MapReduce @ Google
2004
MapReduce paper
2006
Hadoop @Yahoo!
2004 2006 2008 2010 2012 2014
2014
Apache Spark top-level
2010
Spark paper
2008
Hadoop Summit
39
circa 2002: 

mitigate risk of large distributed workloads lost 

due to disk failures on commodity hardware…
Google File System 	

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung	

research.google.com/archive/gfs.html	

!
MapReduce: Simplified Data Processing on Large Clusters 	

Jeffrey Dean, Sanjay Ghemawat	

research.google.com/archive/mapreduce.html
A Brief History: MapReduce
A Brief History: MapReduce
circa 1979 – Stanford, MIT, CMU, etc.

set/list operations in LISP, Prolog, etc., for parallel processing

www-formal.stanford.edu/jmc/history/lisp/lisp.htm	

circa 2004 – Google

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

research.google.com/archive/mapreduce.html	

circa 2006 – Apache

Hadoop, originating from the Nutch Project

Doug Cutting

research.yahoo.com/files/cutting.pdf	

circa 2008 – Yahoo

web scale search indexing

Hadoop Summit, HUG, etc.

developer.yahoo.com/hadoop/	

circa 2009 – Amazon AWS

Elastic MapReduce

Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc.

aws.amazon.com/elasticmapreduce/
40
Open Discussion: 	

Enumerate several changes in data center
technologies since 2002…
A Brief History: MapReduce
41
pistoncloud.com/2013/04/storage-
and-the-mobility-gap/
Rich Freitas, IBM Research
A Brief History: MapReduce
meanwhile, spinny
disks haven’t changed
all that much…
storagenewsletter.com/rubriques/hard-
disk-drives/hdd-technology-trends-ibm/
42
MapReduce use cases showed two major
limitations:	

1. difficultly of programming directly in MR	

2. performance bottlenecks, or batch not
fitting the use cases	

In short, MR doesn’t compose well for large
applications	

Therefore, people built specialized systems as
workarounds…
A Brief History: MapReduce
43
44
MR doesn’t compose well for large applications, 

and so specialized systems emerged as workarounds
MapReduce
General Batch Processing Specialized Systems:
iterative, interactive, streaming, graph, etc.
Pregel Giraph
Dremel Drill
Tez
Impala
GraphLab
StormS4
F1
MillWheel
A Brief History: MapReduce
Developed in 2009 at UC Berkeley AMPLab, then
open sourced in 2010, Spark has since become 

one of the largest OSS communities in big data,
with over 200 contributors in 50+ organizations
spark.apache.org
“Organizations that are looking at big data challenges –

including collection, ETL, storage, exploration and analytics –

should consider Spark for its in-memory performance and

the breadth of its model. It supports advanced analytics

solutions on Hadoop clusters, including the iterative model

required for machine learning and graph analysis.”	

Gartner, Advanced Analytics and Data Science (2014)
45
A Brief History: Spark
46
Spark: Cluster Computing withWorking Sets	

Matei Zaharia, Mosharaf Chowdhury, 

Michael Franklin, Scott Shenker, Ion Stoica

people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf	

!
Resilient Distributed Datasets:A Fault-Tolerant Abstraction for	

In-Memory Cluster Computing	

Matei Zaharia, Mosharaf Chowdhury,Tathagata Das,Ankur Dave, 

Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica	

usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
circa 2010: 

a unified engine for enterprise data workflows, 

based on commodity hardware a decade later…
A Brief History: Spark
A Brief History: Spark
Unlike the various specialized systems, Spark’s
goal was to generalize MapReduce to support
new apps within same engine	

Two reasonably small additions are enough to
express the previous models:	

• fast data sharing 	

• general DAGs	

This allows for an approach which is more
efficient for the engine, and much simpler 

for the end users
47
48
A Brief History: Spark
Some key points about Spark:	

• handles batch, interactive, and real-time 

within a single framework	

• native integration with Java, Python, Scala	

• programming at a higher level of abstraction	

• more general: map/reduce is just one set 

of supported constructs
A Brief History: Spark
49
• generalized patterns

unified engine for many use cases	

• lazy evaluation of the lineage graph

reduces wait states, better pipelining	

• generational differences in hardware

off-heap use of large memory spaces	

• functional programming / ease of use

reduction in cost to maintain large apps	

• lower overhead for starting jobs	

• less expensive shuffles
A Brief History: Key distinctions for Spark vs. MapReduce
50
databricks.com/blog/2014/11/05/spark-officially-
sets-a-new-record-in-large-scale-sorting.html
TL;DR: SmashingThe Previous Petabyte Sort Record
51
Spark is one of the most active Apache projects
ohloh.net/orgs/apache
52
TL;DR: Sustained Exponential Growth
oreilly.com/data/free/2014-data-science-
salary-survey.csp
TL;DR: Spark ExpertiseTops Median Salaries within Big Data
53
databricks.com/blog/2015/01/27/big-data-projects-are-
hungry-for-simpler-and-more-powerful-tools-survey-
validates-apache-spark-is-gaining-developer-traction.html
TL;DR: Spark Survey 2015 by Databricks +Typesafe
54
Ex #3: 

WC, Joins, Shuffles
A:
stage 1
B:
C:
stage 2
D:
stage 3
E:
map()
map()
map()
map()
join()
cached
partition
RDD
Coding Exercise: WordCount
void map (String doc_id, String text):!
for each word w in segment(text):!
emit(w, "1");!
!
!
void reduce (String word, Iterator group):!
int count = 0;!
!
for each pc in group:!
count += Int(pc);!
!
emit(word, String(count));
Definition: 	

count how often each word appears 

in a collection of text documents	

This simple program provides a good test case 

for parallel processing, since it:	

• requires a minimal amount of code	

• demonstrates use of both symbolic and 

numeric values	

• isn’t many steps away from search indexing	

• serves as a “HelloWorld” for Big Data apps	

!
A distributed computing framework that can run
WordCount efficiently in parallel at scale 

can likely handle much larger and more interesting
compute problems
count how often each word appears 

in a collection of text documents
56
WordCount in 3 lines of Spark
WordCount in 50+ lines of Java MR
57
Coding Exercise: WordCount
58
Clone and run /_SparkCamp/02.wc_example

in your folder:
Coding Exercise: WordCount
59
Clone and run /_SparkCamp/03.join_example

in your folder:
Coding Exercise: Join
A:
stage 1
B:
C:
stage 2
D:
stage 3
E:
map() map()
map() map()
join()
cached
partition
RDD
60
Coding Exercise: Join and its Operator Graph
15 min break:
DBC Essentials
evaluation
optimization
representation
circa 2010
ETL into
cluster/cloud
data data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable resultsdecisions, feedback
bar developers
foo algorithms
63
DBC Essentials: What is Databricks Cloud?
Databricks Platform
Databricks Workspace 
Also see FAQ for more details…
64
DBC Essentials: What is Databricks Cloud?
key concepts
Shard an instance of Databricks Workspace
Cluster a Spark cluster (multiple per shard)
Notebook
a list of markdown, executable
commands, and results
Dashboard
a flexible space to create operational
visualizations
Also see FAQ for more details…
65
DBC Essentials: Notebooks
• Series of commands (think shell++)	

• Each notebook has a language type, 

chosen at notebook creation:	

• Python + SQL	

• Scala + SQL	

• SQL only	

• Command output captured in notebook 	

• Commands can be…	

• edited, reordered, rerun, exported, 

cloned, imported, etc.
66
DBC Essentials: Clusters
• Open source Spark clusters hosted in the cloud	

• Access the Spark UI	

• Attach and Detach notebooks to clusters	

!
NB: our training shards use 7 GB cluster
configurations
67
DBC Essentials: Team, State, Collaboration, Elastic Resources
Cloud
login
state
attached
Spark
cluster
Shard
Notebook
Spark
cluster
detached
Browser
team
Browserlogin
import/
export
Local
Copies
68
DBC Essentials: Team, State, Collaboration, Elastic Resources
Excellent collaboration properties, based
on the use of:	

• comments	

• cloning	

• decoupled state of notebooks vs.
clusters	

• relative independence of code blocks
within a notebook
How to “Think Notebooks”
How to “think” in terms of leveraging notebooks,
based on Computational Thinking:
70
Think Notebooks:
“The way we depict
space has a great
deal to do with how
we behave in it.”

– David Hockney
71
“The impact of computing extends far beyond

science… affecting all aspects of our lives. 

To flourish in today's world, everyone needs

computational thinking.” – CMU	

Computing now ranks alongside the proverbial
Reading,Writing, and Arithmetic…
Center for ComputationalThinking @ CMU

http://www.cs.cmu.edu/~CompThink/	

Exploring ComputationalThinking @ Google

https://www.google.com/edu/computational-thinking/
Think Notebooks: ComputationalThinking
72
Computational Thinking provides a structured
way of conceptualizing the problem… 	

In effect, developing notes for yourself and
your team	

These in turn can become the basis for team
process, software requirements, etc., 	

In other words, conceptualize how to leverage
computing resources at scale to build high-ROI
apps for Big Data
Think Notebooks: ComputationalThinking
73
The general approach, in four parts:	

• Decomposition: decompose a complex
problem into smaller solvable problems	

• Pattern Recognition: identify when a 

known approach can be leveraged	

• Abstraction: abstract from those patterns 

into generalizations as strategies	

• Algorithm Design: articulate strategies as
algorithms, i.e. as general recipes for how to
handle complex problems
Think Notebooks: ComputationalThinking
How to “think” in terms of leveraging notebooks, 

by the numbers:	

1. create a new notebook	

2. copy the assignment description as markdown	

3. split it into separate code cells	

4. for each step, write your code under the
markdown	

5. run each step and verify your results
74
Think Notebooks:
Let’s assemble the pieces of the previous few 

code examples, using two files:	

/mnt/paco/intro/CHANGES.txt

/mnt/paco/intro/README.md"
1. create RDDs to filter each line for the 

keyword Spark	

2. perform a WordCount on each, i.e., so the
results are (K,V) pairs of (keyword, count)	

3. join the two RDDs	

4. how many instances of Spark are there in 

each file?
75
Coding Exercises: Workflow assignment
Tour of Spark API
Cluster Manager
Driver Program
SparkContext
Worker Node
Executor cache
task
task
Worker Node
Executor cache
task
task
The essentials of the Spark API in both Scala
and Python…	

/_SparkCamp/05.scala_api"
/_SparkCamp/05.python_api"
!
Let’s start with the basic concepts, which are
covered in much more detail in the docs:	

spark.apache.org/docs/latest/scala-
programming-guide.html
Spark Essentials:
77
First thing that a Spark program does is create
a SparkContext object, which tells Spark how
to access a cluster	

In the shell for either Scala or Python, this is
the sc variable, which is created automatically	

Other programs must use a constructor to
instantiate a new SparkContext	

Then in turn SparkContext gets used to create
other variables
Spark Essentials: SparkContext
78
sc"
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6ad51e9c
Spark Essentials: SparkContext
sc"
Out[1]: <__main__.RemoteContext at 0x7ff0bfb18a10>
Scala:
Python:
79
The master parameter for a SparkContext
determines which cluster to use
Spark Essentials: Master
master description
local
run Spark locally with one worker thread 

(no parallelism)
local[K]
run Spark locally with K worker threads 

(ideally set to # cores)	

spark://HOST:PORT
connect to a Spark standalone cluster; 

PORT depends on config (7077 by default)	

mesos://HOST:PORT
connect to a Mesos cluster; 

PORT depends on config (5050 by default)	

80
Cluster ManagerDriver Program
SparkContext
Worker Node
Executor cache
tasktask
Worker Node
Executor cache
tasktask
spark.apache.org/docs/latest/cluster-
overview.html
Spark Essentials: Master
81
Cluster ManagerDriver Program
SparkContext
Worker Node
Executor cache
tasktask
Worker Node
Executor cache
tasktask
The driver performs the following:	

1. connects to a cluster manager to allocate
resources across applications	

2. acquires executors on cluster nodes –
processes run compute tasks, cache data	

3. sends app code to the executors	

4. sends tasks for the executors to run
Spark Essentials: Clusters
82
Resilient Distributed Datasets (RDD) are the
primary abstraction in Spark – a fault-tolerant
collection of elements that can be operated on 

in parallel	

There are currently two types: 	

• parallelized collections – take an existing Scala
collection and run functions on it in parallel	

• Hadoop datasets – run functions on each record
of a file in Hadoop distributed file system or any
other storage system supported by Hadoop
Spark Essentials: RDD
83
• two types of operations on RDDs: 

transformations and actions	

• transformations are lazy 

(not computed immediately)	

• the transformed RDD gets recomputed 

when an action is run on it (default)	

• however, an RDD can be persisted into 

storage in memory or disk
Spark Essentials: RDD
84
val data = Array(1, 2, 3, 4, 5)"
data: Array[Int] = Array(1, 2, 3, 4, 5)"
!
val distData = sc.parallelize(data)"
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24970]
Spark Essentials: RDD
data = [1, 2, 3, 4, 5]"
data"
Out[2]: [1, 2, 3, 4, 5]"
!
distData = sc.parallelize(data)"
distData"
Out[3]: ParallelCollectionRDD[24864] at parallelize at PythonRDD.scala:364
Scala:
Python:
85
Spark can create RDDs from any file stored in HDFS
or other storage systems supported by Hadoop, e.g.,
local file system,Amazon S3, Hypertable, HBase, etc.	

Spark supports text files, SequenceFiles, and any
other Hadoop InputFormat, and can also take a
directory or a glob (e.g. /data/201404*)
Spark Essentials: RDD
action value
RDD
RDD
RDD
transformations RDD
86
val distFile = sqlContext.table("readme")"
distFile: org.apache.spark.sql.SchemaRDD = "
SchemaRDD[24971] at RDD at SchemaRDD.scala:108
Spark Essentials: RDD
distFile = sqlContext.table("readme").map(lambda x: x[0])"
distFile"
Out[11]: PythonRDD[24920] at RDD at PythonRDD.scala:43
Scala:
Python:
87
Transformations create a new dataset from 

an existing one	

All transformations in Spark are lazy: they 

do not compute their results right away –
instead they remember the transformations
applied to some base dataset	

• optimize the required calculations	

• recover from lost data partitions
Spark Essentials: Transformations
88
Spark Essentials: Transformations
transformation description
map(func)
return a new distributed dataset formed by passing 

each element of the source through a function func
filter(func)
return a new dataset formed by selecting those
elements of the source on which func returns true	

flatMap(func)
similar to map, but each input item can be mapped 

to 0 or more output items (so func should return a 

Seq rather than a single item)
sample(withReplacement,
fraction, seed)
sample a fraction fraction of the data, with or without
replacement, using a given random number generator
seed
union(otherDataset)
return a new dataset that contains the union of the
elements in the source dataset and the argument
distinct([numTasks]))
return a new dataset that contains the distinct elements
of the source dataset
89
Spark Essentials: Transformations
transformation description
groupByKey([numTasks])
when called on a dataset of (K, V) pairs, returns a
dataset of (K, Seq[V]) pairs
reduceByKey(func,
[numTasks])
when called on a dataset of (K, V) pairs, returns 

a dataset of (K, V) pairs where the values for each 

key are aggregated using the given reduce function
sortByKey([ascending],
[numTasks])
when called on a dataset of (K, V) pairs where K
implements Ordered, returns a dataset of (K, V) 

pairs sorted by keys in ascending or descending order,
as specified in the boolean ascending argument
join(otherDataset,
[numTasks])
when called on datasets of type (K, V) and (K, W),
returns a dataset of (K, (V, W)) pairs with all pairs 

of elements for each key
cogroup(otherDataset,
[numTasks])
when called on datasets of type (K, V) and (K, W),
returns a dataset of (K, Seq[V], Seq[W]) tuples –
also called groupWith
cartesian(otherDataset)
when called on datasets of types T and U, returns a
dataset of (T, U) pairs (all pairs of elements)
90
val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])"
distFile.map(l => l.split(" ")).collect()"
distFile.flatMap(l => l.split(" ")).collect()
Spark Essentials: Transformations
distFile = sqlContext.table("readme").map(lambda x: x[0])"
distFile.map(lambda x: x.split(' ')).collect()"
distFile.flatMap(lambda x: x.split(' ')).collect()
Scala:
Python:
distFile is a collection of lines
91
Spark Essentials: Transformations
Scala:
Python:
closures
val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])"
distFile.map(l => l.split(" ")).collect()"
distFile.flatMap(l => l.split(" ")).collect()
distFile = sqlContext.table("readme").map(lambda x: x[0])"
distFile.map(lambda x: x.split(' ')).collect()"
distFile.flatMap(lambda x: x.split(' ')).collect()
92
Spark Essentials: Transformations
Scala:
Python:
closures
val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])"
distFile.map(l => l.split(" ")).collect()"
distFile.flatMap(l => l.split(" ")).collect()
distFile = sqlContext.table("readme").map(lambda x: x[0])"
distFile.map(lambda x: x.split(' ')).collect()"
distFile.flatMap(lambda x: x.split(' ')).collect()
93
looking at the output, how would you 

compare results for map() vs. flatMap() ?
Spark Essentials: Actions
action description
reduce(func)
aggregate the elements of the dataset using a function
func (which takes two arguments and returns one), 

and should also be commutative and associative so 

that it can be computed correctly in parallel
collect()
return all the elements of the dataset as an array at 

the driver program – usually useful after a filter or
other operation that returns a sufficiently small subset
of the data
count() return the number of elements in the dataset
first()
return the first element of the dataset – similar to
take(1)
take(n)
return an array with the first n elements of the dataset
– currently not executed in parallel, instead the driver
program computes all the elements
takeSample(withReplacement,
fraction, seed)
return an array with a random sample of num elements
of the dataset, with or without replacement, using the
given random number generator seed
94
Spark Essentials: Actions
action description
saveAsTextFile(path)
write the elements of the dataset as a text file (or set 

of text files) in a given directory in the local filesystem,
HDFS or any other Hadoop-supported file system.
Spark will call toString on each element to convert 

it to a line of text in the file
saveAsSequenceFile(path)
write the elements of the dataset as a Hadoop
SequenceFile in a given path in the local filesystem,
HDFS or any other Hadoop-supported file system. 

Only available on RDDs of key-value pairs that either
implement Hadoop's Writable interface or are
implicitly convertible to Writable (Spark includes
conversions for basic types like Int, Double, String,
etc).
countByKey()
only available on RDDs of type (K, V). Returns a 

`Map` of (K, Int) pairs with the count of each key
foreach(func)
run a function func on each element of the dataset –
usually done for side effects such as updating an
accumulator variable or interacting with external
storage systems
95
val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])"
val words = f.flatMap(l => l.split(" ")).map(word => (word, 1))"
words.reduceByKey(_ + _).collect.foreach(println)
Spark Essentials: Actions
from operator import add"
f = sqlContext.table("readme").map(lambda x: x[0])"
words = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))"
words.reduceByKey(add).collect()
Scala:
Python:
96
Spark can persist (or cache) a dataset in
memory across operations	

spark.apache.org/docs/latest/programming-guide.html#rdd-
persistence	

Each node stores in memory any slices of it
that it computes and reuses them in other
actions on that dataset – often making future
actions more than 10x faster	

The cache is fault-tolerant: if any partition 

of an RDD is lost, it will automatically be
recomputed using the transformations that
originally created it
Spark Essentials: Persistence
97
Spark Essentials: Persistence
transformation description
MEMORY_ONLY
Store RDD as deserialized Java objects in the JVM. 

If the RDD does not fit in memory, some partitions 

will not be cached and will be recomputed on the fly
each time they're needed.This is the default level.
MEMORY_AND_DISK
Store RDD as deserialized Java objects in the JVM. 

If the RDD does not fit in memory, store the partitions
that don't fit on disk, and read them from there when
they're needed.
MEMORY_ONLY_SER
Store RDD as serialized Java objects (one byte array 

per partition).This is generally more space-efficient 

than deserialized objects, especially when using a fast
serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER
Similar to MEMORY_ONLY_SER, but spill partitions
that don't fit in memory to disk instead of recomputing
them on the fly each time they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc
Same as the levels above, but replicate each partition 

on two cluster nodes.
OFF_HEAP (experimental)
Store RDD in serialized format in Tachyon.
98
val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])"
val words = f.flatMap(l => l.split(" ")).map(word => (word, 1)).cache()"
words.reduceByKey(_ + _).collect.foreach(println)
Spark Essentials: Persistence
from operator import add"
f = sqlContext.table("readme").map(lambda x: x[0])"
w = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).cache()"
w.reduceByKey(add).collect()
Scala:
Python:
99
Broadcast variables let programmer keep a
read-only variable cached on each machine
rather than shipping a copy of it with tasks	

For example, to give every node a copy of 

a large input dataset efficiently	

Spark also attempts to distribute broadcast
variables using efficient broadcast algorithms
to reduce communication cost
Spark Essentials: BroadcastVariables
100
val broadcastVar = sc.broadcast(Array(1, 2, 3))"
broadcastVar.value!
res10: Array[Int] = Array(1, 2, 3)
Spark Essentials: BroadcastVariables
broadcastVar = sc.broadcast(list(range(1, 4)))"
broadcastVar.value!
Out[15]: [1, 2, 3]
Scala:
Python:
101
Accumulators are variables that can only be
“added” to through an associative operation	

Used to implement counters and sums,
efficiently in parallel	

Spark natively supports accumulators of
numeric value types and standard mutable
collections, and programmers can extend 

for new types	

Only the driver program can read an
accumulator’s value, not the tasks
Spark Essentials: Accumulators
102
val accum = sc.accumulator(0)"
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)"
!
accum.value!
res11: Int = 10
Spark Essentials: Accumulators
accum = sc.accumulator(0)"
rdd = sc.parallelize([1, 2, 3, 4])"
def f(x):"
global accum"
accum += x"
!
rdd.foreach(f)"
!
accum.value!
Out[16]: 10
Scala:
Python:
103
val accum = sc.accumulator(0)"
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)"
!
accum.value!
res11: Int = 10
Spark Essentials: Accumulators
accum = sc.accumulator(0)"
rdd = sc.parallelize([1, 2, 3, 4])"
def f(x):"
global accum"
accum += x"
!
rdd.foreach(f)"
!
accum.value!
Out[16]: 10
Scala:
Python:
driver-side
104
For a deep-dive about broadcast variables
and accumulator usage in Spark, see also:	

Advanced Spark Features

Matei Zaharia, Jun 2012

ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-
zaharia-amp-camp-2012-advanced-spark.pdf
Spark Essentials: BroadcastVariables and Accumulators
105
val pair = (a, b)"
 "
pair._1 // => a"
pair._2 // => b
Spark Essentials: (K,V) pairs
Scala:
106
Python:
pair = (a, b)"
 "
pair[0] # => a"
pair[1] # => b
Spark Essentials: API Details
For more details about the Scala API:	

spark.apache.org/docs/latest/api/scala/
index.html#org.apache.spark.package	

For more details about the Python API:	

spark.apache.org/docs/latest/api/python/
107
Spark @Strata
Keynote: New Directions for Spark in 2015
Fri Feb 20 9:15am-9:25am

strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/39547
As the Apache Spark userbase grows, the developer community is working 

to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the
enterprise and major improvements in its performance, scalability and
standard libraries. In 2015, we want to make Spark accessible to a wider 

set of users, through new high-level APIs for data science: machine learning
pipelines, data frames, and R language bindings. In addition, we are defining
extension points to let Spark grow as a platform, making it easy to plug in
data sources, algorithms, and external packages. Like all work on Spark, 

these APIs are designed to plug seamlessly into Spark applications, giving
users a unified platform for streaming, batch and interactive data processing.
Matei Zaharia – started the Spark project
at UC Berkeley, currently CTO of Databricks,
SparkVP at Apache, and an assistant professor
at MIT
Spark Camp: Ask Us Anything
Fri, Feb 20 2:20pm-3:00pm

strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/40701
Join the Spark team for an informal question and
answer session. Several of the Spark committers,
trainers, etc., from Databricks will be on hand to
field a wide range of detailed questions. 	

Even if you don’t have a specific question, join 

in to hear what others are asking!
Databricks Spark Talks @Strata + Hadoop World
Thu Feb 19 10:40am-11:20am

strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/38237
Lessons from Running Large Scale SparkWorkloads

Reynold Xin, Matei Zaharia
Thu Feb 19 4:00pm–4:40pm

strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/38518
Spark Streaming -The State of the Union, and Beyond

Tathagata Das
Databricks Spark Talks @Strata + Hadoop World
Fri Feb 20 11:30am-12:10pm

strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/38237
Tuning and Debugging in Apache Spark

Patrick Wendell
Fri Feb 20 4:00pm–4:40pm

strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/38391
Everyday I’m Shuffling -Tips forWriting Better Spark Programs

Vida Ha, Holden Karau
Spark Developer Certification

Fri Feb 20, 2015 10:40am-12:40pm
• go.databricks.com/spark-certified-developer
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
• 40 multiple-choice questions, 90 minutes	

• mostly structured as choices among code blocks	

• expect some Python, Java, Scala, SQL	

• understand theory of operation	

• identify best practices	

• recognize code that is more parallel, less
memory constrained	

!
Overall, you need to write Spark apps in practice
Developer Certification: Overview
114
60 min lunch:
Intro to MLlib
Songs Demo
Spark SQL
blurs the lines between RDDs and relational tables	

spark.apache.org/docs/latest/sql-programming-
guide.html	

!
intermix SQL commands to query external data,
along with complex analytics, in a single app:	

• allows SQL extensions based on MLlib	

• provides the “heavy lifting” for ETL in DBC
Spark SQL: Manipulating Structured Data Using Spark

Michael Armbrust, Reynold Xin (2014-03-24)

databricks.com/blog/2014/03/26/Spark-SQL-
manipulating-structured-data-using-Spark.html
119
Spark SQL: DataWorkflows
Parquet is a columnar format, supported 

by many different Big Data frameworks	

http://parquet.io/	

Spark SQL supports read/write of parquet files, 

automatically preserving the schema of the 

original data (HUGE benefits)	

Modifying the previous example…
120
Spark SQL: DataWorkflows – Parquet
121
Demo of /_SparkCamp/demo_sql_scala 

by the instructor:
Spark SQL: SQL Demo
122
Next, we’ll review the following sections in the

Databricks Guide:	

/databricks_guide/Importing Data"
/databricks_guide/Databricks File System"
!
Key Topics:	

• JSON, CSV, Parquet	

• S3, Hive, Redshift	

• DBFS, dbutils
Spark SQL: Using DBFS
Visualization
124
For any SQL query, you can show the results 

as a table, or generate a plot from with a 

single click…
Visualization: Built-in Plots
125
Several of the plot types have additional options 

to customize the graphs they generate…
Visualization: Plot Options
126
For example, series groupings can be used to help
organize bar charts…
Visualization: Series Groupings
127
See /databricks-guide/05 Visualizations 

for details about built-in visualizations and
extensions…
Visualization: Reference Guide
128
The display() command:	

• programmatic access to visualizations	

• pass a SchemaRDD to print as an HTML table	

• pass a Scala list to print as an HTML table	

• call without arguments to display matplotlib
figures
Visualization: Using display()
129
The displayHTML() command:	

• render any arbitrary HTML/JavaScript	

• include JavaScript libraries (advanced feature)	

• paste in D3 examples to get a sense for this…
Visualization: Using displayHTML()
130
Clone the entire folder /_SparkCamp/Viz D3

into your folder and run its notebooks:
Demo: D3Visualization
131
Clone and run /_SparkCamp/07.sql_visualization

in your folder:
Coding Exercise: SQL +Visualization
15 min break:
Spark Streaming
Let’s consider the top-level requirements for 

a streaming framework:	

• clusters scalable to 100’s of nodes	

• low-latency, in the range of seconds

(meets 90% of use case needs)	

• efficient recovery from failures

(which is a hard problem in CS)	

• integrates with batch: many co’s run the 

same business logic both online+offline
Spark Streaming: Requirements
134
Therefore, run a streaming computation as: 

a series of very small, deterministic batch jobs	

!
• Chop up the live stream into 

batches of X seconds 	

• Spark treats each batch of 

data as RDDs and processes 

them using RDD operations	

• Finally, the processed results 

of the RDD operations are 

returned in batches
Spark Streaming: Requirements
135
Therefore, run a streaming computation as: 

a series of very small, deterministic batch jobs	

!
• Batch sizes as low as ½ sec, 

latency of about 1 sec	

• Potential for combining 

batch processing and 

streaming processing in 

the same system
Spark Streaming: Requirements
136
Data can be ingested from many sources: 

Kafka, Flume, Twitter, ZeroMQ,TCP sockets, etc.	

Results can be pushed out to filesystems,
databases, live dashboards, etc.	

Spark’s built-in machine learning algorithms and
graph processing algorithms can be applied to
data streams
Spark Streaming: Integration
137
2012	

 	

 project started	

2013	

 	

 alpha release (Spark 0.7)	

2014	

 	

 graduated (Spark 0.9)
Spark Streaming: Timeline
Discretized Streams:A Fault-Tolerant Model 

for Scalable Stream Processing	

Matei Zaharia,Tathagata Das, Haoyuan Li, 

Timothy Hunter, Scott Shenker, Ion Stoica	

Berkeley EECS (2012-12-14)	

www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
project lead: 

Tathagata Das @tathadas
138
Typical kinds of applications:	

• datacenter operations	

• web app funnel metrics	

• ad optimization	

• anti-fraud	

• telecom	

• video analytics	

• various telematics	

and much much more!
Spark Streaming: Use Cases
139
Programming Guide

spark.apache.org/docs/latest/streaming-
programming-guide.html	

TD @ Spark Summit 2014

youtu.be/o-NXwFrNAWQ?list=PLTPXxbhUt-
YWGNTaDj6HSjnHMxiTD1HCR	

“Deep Dive into Spark Streaming”

slideshare.net/spark-project/deep-
divewithsparkstreaming-
tathagatadassparkmeetup20130617	

Spark Reference Applications

databricks.gitbooks.io/databricks-spark-
reference-applications/
Spark Streaming: Some Excellent Resources
140
import org.apache.spark.streaming._"
import org.apache.spark.streaming.StreamingContext._"
!
// create a StreamingContext with a SparkConf configuration"
val ssc = new StreamingContext(sparkConf, Seconds(10))"
!
// create a DStream that will connect to serverIP:serverPort"
val lines = ssc.socketTextStream(serverIP, serverPort)"
!
// split each line into words"
val words = lines.flatMap(_.split(" "))"
!
// count each word in each batch"
val pairs = words.map(word => (word, 1))"
val wordCounts = pairs.reduceByKey(_ + _)"
!
// print a few of the counts to the console"
wordCounts.print()"
!
ssc.start()"
ssc.awaitTermination()
Quiz: name the bits and pieces…
141
import sys"
from pyspark import SparkContext"
from pyspark.streaming import StreamingContext"
!
sc = SparkContext(appName="PyStreamNWC", master="local[*]")"
ssc = StreamingContext(sc, Seconds(5))"
!
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))"
!
counts = lines.flatMap(lambda line: line.split(" ")) "
.map(lambda word: (word, 1)) "
.reduceByKey(lambda a, b: a+b)"
!
counts.pprint()"
!
ssc.start()"
ssc.awaitTermination()
Demo: PySpark Streaming NetworkWord Count
142
import sys"
from pyspark import SparkContext"
from pyspark.streaming import StreamingContext"
!
def updateFunc (new_values, last_sum):"
return sum(new_values) + (last_sum or 0)"
!
sc = SparkContext(appName="PyStreamNWC", master="local[*]")"
ssc = StreamingContext(sc, Seconds(5))"
ssc.checkpoint("checkpoint")"
!
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))"
!
counts = lines.flatMap(lambda line: line.split(" ")) "
.map(lambda word: (word, 1)) "
.updateStateByKey(updateFunc) "
.transform(lambda x: x.sortByKey())"
!
counts.pprint()"
!
ssc.start()"
ssc.awaitTermination()
Demo: PySpark Streaming NetworkWord Count - Stateful
143
Ref Apps:
Kinesis Demo

Twitter Streaming Demo
145
databricks.gitbooks.io/databricks-spark-reference-applications/
content/twitter_classifier/README.html
Demo: Twitter Streaming Language Classifier
146
Streaming:
collect tweets
Twitter API
HDFS:
dataset
Spark SQL:
ETL, queries
MLlib:
train classifier
Spark:
featurize
HDFS:
model
Streaming:
score tweets
language
filter
Demo: Twitter Streaming Language Classifier
147
1. extract text
from the tweet
https://twitter.com/
andy_bf/status/
16222269370011648
"Ceci n'est pas un tweet"
2. sequence
text as bigrams
tweet.sliding(2).toSeq ("Ce", "ec", "ci", …, )
3. convert
bigrams into
numbers
seq.map(_.hashCode()) (2178, 3230, 3174, …, )
4. index into
sparse tf vector!
seq.map(_.hashCode() %
1000)
(178, 230, 174, …, )
5. increment
feature count
Vector.sparse(1000, …) (1000, [102, 104, …],
[0.0455, 0.0455, …])
Demo: Twitter Streaming Language Classifier
From tweets to ML features,
approximated as sparse
vectors:
Building a Scala JAR
SBT is the Simple Build Tool for Scala:	

www.scala-sbt.org/	

This is included with the Spark download, and 

does not need to be installed separately.	

Similar to Maven, however it provides for
incremental compilation and an interactive shell, 

among other innovations.	

SBT project uses StackOverflow for Q&A, 

that’s a good resource to study further:	

stackoverflow.com/tags/sbt
Spark in Production: Build: SBT
149
Spark in Production: Build: SBT
command description
clean
delete all generated files 

(in the target directory)
package
create a JAR file
run
run the JAR 

(or main class, if named)
compile
compile the main sources 

(in src/main/scala and src/main/java directories)
test
compile and run all tests
console
launch a Scala interpreter
help
display detailed help for specified commands
150
builds:	

• build/run a JAR using Java + Maven	

• SBT primer	

• build/run a JAR using Scala + SBT
Spark in Production: Build: Scala
151
The following sequence shows how to build 

a JAR file from a Scala app, using SBT	

• First, this requires the “source” download,
not the “binary”	

• Connect into the SPARK_HOME directory	

• Then run the following commands…
Spark in Production: Build: Scala
152
# Scala source + SBT build script on following slides!
!
cd simple-app"
!
../sbt/sbt -Dsbt.ivy.home=../sbt/ivy package"
!
../spark/bin/spark-submit "
--class "SimpleApp" "
--master local[*] "
target/scala-2.10/simple-project_2.10-1.0.jar
Spark in Production: Build: Scala
153
/*** SimpleApp.scala ***/"
import org.apache.spark.SparkContext"
import org.apache.spark.SparkContext._"
!
object SimpleApp {"
def main(args: Array[String]) {"
val logFile = "README.md" // Should be some file on your system"
val sc = new SparkContext("local", "Simple App", "SPARK_HOME","
List("target/scala-2.10/simple-project_2.10-1.0.jar"))"
val logData = sc.textFile(logFile, 2).cache()"
!
val numAs = logData.filter(line => line.contains("a")).count()"
val numBs = logData.filter(line => line.contains("b")).count()"
!
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))"
}"
}
Spark in Production: Build: Scala
154
name := "Simple Project""
!
version := "1.0""
!
scalaVersion := "2.10.4""
!
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.2.0""
!
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
Spark in Production: Build: Scala
155
Deploying an App
Spark in Production: Databricks Cloud
157
Databricks Platform
Databricks Workspace 
Arguably, one of the simplest ways to deploy Apache Spark
is to use Databricks Cloud for cloud-based notebooks
DataStax Enterprise 4.6:	

www.datastax.com/docs/latest-dse/	

datastax.com/dev/blog/interactive-advanced-analytic-with-dse-
and-spark-mllib	

• integration of Cassandra (elastic storage) and 

Spark (elastic compute)	

• github.com/datastax/spark-cassandra-connector	

• exposes Cassandra tables as Spark RDDs	

• converts data types between Cassandra and Scala	

• allows for execution of arbitrary CQL statements	

• plays nice with CassandraVirtual Nodes
Spark in Production: DSE
158
Apache Mesos, from which Apache Spark 

originated…	

Running Spark on Mesos

spark.apache.org/docs/latest/running-on-mesos.html 	

Run Apache Spark on Apache Mesos

tutorial based on Mesosphere + Google Cloud

ceteri.blogspot.com/2014/09/spark-atop-mesos-on-google-cloud.html	

Getting Started Running Apache Spark on Apache Mesos

O’Reilly Media webcast

oreilly.com/pub/e/2986
Spark in Production: Mesos
159
Cloudera Manager 4.8.x:	

cloudera.com/content/cloudera-content/cloudera-
docs/CM4Ent/latest/Cloudera-Manager-Installation-
Guide/cmig_spark_installation_standalone.html	

• 5 steps to install the Spark parcel	

• 5 steps to configure and start the Spark service	

Also check out Cloudera Live:	

cloudera.com/content/cloudera/en/products-and-
services/cloudera-live.html
Spark in Production: CM
160
MapR Technologies provides support for running 

Spark on the MapR distros:	

mapr.com/products/apache-spark	

slideshare.net/MapRTechnologies/map-r-
databricks-webinar-4x3
Spark in Production: MapR
161
Hortonworks provides support for running 

Spark on HDP:	

spark.apache.org/docs/latest/hadoop-third-party-
distributions.html	

hortonworks.com/blog/announcing-hdp-2-1-tech-
preview-component-apache-spark/
Spark in Production: HDP
162
Running Spark on Amazon AWS EC2:	

blogs.aws.amazon.com/bigdata/post/Tx15AY5C50K70RV/
Installing-Apache-Spark-on-an-Amazon-EMR-Cluster
Spark in Production: EC2
163
Spark in MapReduce (SIMR) – quick way 

for Hadoop MR1 users to deploy Spark:	

databricks.github.io/simr/	

spark-summit.org/talk/reddy-simr-let-your-
spark-jobs-simmer-inside-hadoop-clusters/	

• Sparks run on Hadoop clusters without 

any install or required admin rights	

• SIMR launches a Hadoop job that only 

contains mappers, includes Scala+Spark	

./simr jar_file main_class parameters 

[—outdir=] [—slots=N] [—unique]
Spark in Production: SIMR
164
review UI features	

	

	

 spark.apache.org/docs/latest/monitoring.html	

	

http://<master>:8080/	

	

	

 http://<master>:50070/	

• verify: is my job still running?	

• drill-down into workers and stages	

• examine stdout and stderr	

• discuss how to diagnose / troubleshoot
Spark in Production: Monitor
165
Spark in Production: Monitor – Spark Console
166
Spark in Production: Monitor – AWS Console
167
GraphX examples
cost
4
node
0
node
1
node
3
node
2
cost
3
cost
1
cost
2
cost
1
169
GraphX:
spark.apache.org/docs/latest/graphx-
programming-guide.html	

!
Key Points:	

!
• graph-parallel systems	

• importance of workflows	

• optimizations
170
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin

graphlab.org/files/osdi2012-gonzalez-low-gu-
bickson-guestrin.pdf	

Pregel: Large-scale graph computing at Google

Grzegorz Czajkowski, et al.

googleresearch.blogspot.com/2009/06/large-scale-
graph-computing-at-google.html	

GraphX: Unified Graph Analytics on Spark

Ankur Dave, Databricks

databricks-training.s3.amazonaws.com/slides/
graphx@sparksummit_2014-07.pdf	

Advanced Exercises: GraphX

databricks-training.s3.amazonaws.com/graph-
analytics-with-graphx.html
GraphX: Further Reading…
// http://spark.apache.org/docs/latest/graphx-programming-guide.html!
!
import org.apache.spark.graphx._"
import org.apache.spark.rdd.RDD"
!
case class Peep(name: String, age: Int)"
!
val nodeArray = Array("
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),"
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),"
(5L, Peep("Leslie", 45))"
)"
val edgeArray = Array("
Edge(2L, 1L, 7), Edge(2L, 4L, 2),"
Edge(3L, 2L, 4), Edge(3L, 5L, 3),"
Edge(4L, 1L, 1), Edge(5L, 3L, 9)"
)"
!
val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)"
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)"
val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)"
!
val results = g.triplets.filter(t => t.attr > 7)"
!
for (triplet <- results.collect) {"
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")"
}
171
GraphX: Example – simple traversals
172
GraphX: Example – routing problems
cost
4
node
0
node
1
node
3
node
2
cost
3
cost
1
cost
2
cost
1
What is the cost to reach node 0 from any other
node in the graph? This is a common use case for
graph algorithms, e.g., Djikstra
173
Clone and run /_SparkCamp/08.graphx

in your folder:
GraphX: Coding Exercise
Case Studies
Case Studies: Apache Spark, DBC, etc.
Additional details about production deployments
for Apache Spark can be found at:	

https://cwiki.apache.org/confluence/display/
SPARK/Powered+By+Spark	

https://databricks.com/blog/category/company/
partners	

http://go.databricks.com/customer-case-studies
175
Case Studies: Automatic Labs
176
Spark Plugs IntoYour Car

Rob Ferguson

spark-summit.org/east/2015/talk/spark-plugs-into-your-car	

finance.yahoo.com/news/automatic-labs-turns-databricks-
cloud-140000785.html	

Automatic creates personalized driving habit dashboards	

• wanted to use Spark while minimizing investment in DevOps	

• provides data access to non-technical analysts via SQL	

• replaced Redshift and disparate ML tools with single platform 	

• leveraged built-in visualization capabilities in notebooks to
generate dashboards easily and quickly	

• used MLlib on Spark for needed functionality out of the box
Spark atTwitter: Evaluation & Lessons Learnt

Sriram Krishnan

slideshare.net/krishflix/seattle-spark-meetup-spark-at-twitter	

• Spark can be more interactive, efficient than MR 	

• support for iterative algorithms and caching 	

• more generic than traditional MapReduce	

• Why is Spark faster than Hadoop MapReduce? 	

• fewer I/O synchronization barriers 	

• less expensive shuffle	

• the more complex the DAG, the greater the 

performance improvement
177
Case Studies: Twitter
Pearson uses Spark Streaming for next
generation adaptive learning platform	

Dibyendu Bhattacharya	

databricks.com/blog/2014/12/08/pearson-
uses-spark-streaming-for-next-generation-
adaptive-learning-platform.html
178
• Kafka + Spark + Cassandra + Blur, on AWS on aYARN
cluster	

• single platform/common API was a key reason to replace
Storm with Spark Streaming	

• custom Kafka Consumer for Spark Streaming, using Low
Level Kafka Consumer APIs	

• handles: Kafka node failures, receiver failures, leader
changes, committed offset in ZK, tunable data rate
throughput
Case Studies: Pearson
UnlockingYour Hadoop Data with Apache Spark and CDH5	

Denny Lee	

slideshare.net/Concur/unlocking-your-hadoop-data-
with-apache-spark-and-cdh5
179
• leading provider of spend management solutions and
services	

• delivers recommendations based on business users’ travel
and expenses – “to help deliver the perfect trip”	

• use of traditional BI tools with Spark SQL allowed analysts
to make sense of the data without becoming programmers	

• needed the ability to transition quickly between Machine
Learning (MLLib), Graph (GraphX), and SQL usage	

• needed to deliver recommendations in real-time
Case Studies: Concur
Stratio Streaming: a new approach to Spark Streaming	

David Morales, Oscar Mendez	

spark-summit.org/2014/talk/stratio-streaming-a-
new-approach-to-spark-streaming
180
• Stratio Streaming is the union of a real-time messaging bus
with a complex event processing engine atop Spark
Streaming	

• allows the creation of streams and queries on the fly	

• paired with Siddhi CEP engine and Apache Kafka	

• added global features to the engine such as auditing and
statistics
Case Studies: Stratio
Collaborative Filtering with Spark

Chris Johnson

slideshare.net/MrChrisJohnson/collaborative-filtering-with-
spark	

• collab filter (ALS) for music recommendation	

• Hadoop suffers from I/O overhead	

• show a progression of code rewrites, converting a 

Hadoop-based app into efficient use of Spark
181
Case Studies: Spotify
Guavus Embeds Apache Spark 

into its Operational Intelligence Platform 

Deployed at theWorld’s LargestTelcos	

Eric Carr	

databricks.com/blog/2014/09/25/guavus-embeds-apache-spark-
into-its-operational-intelligence-platform-deployed-at-the-
worlds-largest-telcos.html
182
• 4 of 5 top mobile network operators, 3 of 5 top Internet
backbone providers, 80% MSOs in NorAm	

• analyzing 50% of US mobile data traffic, +2.5 PB/day	

• latency is critical for resolving operational issues before
they cascade: 2.5 MM transactions per second	

• “analyze first” not “store first ask questions later”
Case Studies: Guavus
Case Studies: Radius Intelligence
183
From Hadoop to Spark in 4 months, Lessons Learned

Alexis Roos

http://youtu.be/o3-lokUFqvA
• building a full SMB index took 12+ hours using 

Hadoop and Cascading	

• pipeline was difficult to modify/enhance	

• Spark increased pipeline performance 10x 	

• interactive shell and notebooks enabled data scientists 

to experiment and develop code faster	

• PMs and business development staff can use SQL to 

query large data sets
Further Resources +

Q&A
community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
!
video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
Further Resources: Spark Packages
186
Looking for other libraries and features? There
are a variety of third-party packages available at:	

http://spark-packages.org/
Further Resources: DBC Feedback
187
Other feedback, suggestions, etc.?	

http://feedback.databricks.com/
MOOCs:
Anthony Joseph

UC Berkeley	

begins Apr 2015	

edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-big-6181
Ameet Talwalkar

UCLA	

begins Q2 2015	

edx.org/course/uc-berkeleyx/
uc-berkeleyx-cs190-1x-
scalable-machine-6066
confs:
Spark Summit East

NYC, Mar 18-19

spark-summit.org/east
QCon SP

Saõ Paulo, Brazil, Mar 23-27

qconsp.com
Big Data Tech Con

Boston, Apr 26-28

bigdatatechcon.com
Strata EU

London, May 5-7

strataconf.com/big-data-conference-uk-2015
GOTO Chicago

Chicago, May 11-14

gotocon.com/chicago-2015
Spark Summit 2015

SF, Jun 15-17

spark-summit.org
190
http://spark-summit.org/15% discount:

SparkItUpEast15
books:
Fast Data Processing 

with Spark

Holden Karau

Packt (2013)

shop.oreilly.com/product/
9781782167068.do
Spark in Action

Chris Fregly

Manning (2015)

sparkinaction.com/
Learning Spark

Holden Karau, 

Andy Konwinski,
Matei Zaharia

O’Reilly (2015)

shop.oreilly.com/product/
0636920028512.do
About Databricks
• Founded by the creators of Spark in 2013	

• Largest organization contributing to Spark	

• End-to-end hosted service, Databricks Cloud	

• http://databricks.com/

More Related Content

What's hot

Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 

What's hot (20)

Spark etl
Spark etlSpark etl
Spark etl
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 

Viewers also liked

Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.Konstantinos Kloudas
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesAerospike, Inc.
 

Viewers also liked (8)

Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use Cases
 

Similar to Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials

Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on MesosPaco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleMateusz Dymczyk
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Apache spark undocumented extensions
Apache spark undocumented extensionsApache spark undocumented extensions
Apache spark undocumented extensionsSandeep Joshi
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Holden Karau
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 

Similar to Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials (20)

Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Spark 101
Spark 101Spark 101
Spark 101
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Apache spark undocumented extensions
Apache spark undocumented extensionsApache spark undocumented extensions
Apache spark undocumented extensions
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
 
Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials

  • 1. Spark Camp @ Strata CA Intro to Apache Spark with Hands-on Tutorials
 Wed Feb 18, 2015 9:00am–5:00pm download slides:
 training.databricks.com/workshop/sparkcamp.pdf Licensed under a Creative Commons Attribution-NonCommercial- NoDerivatives 4.0 International License
  • 2. Tutorial Outline: morning afternoon Welcome + Getting Started Intro to MLlib Ex 1: Pre-Flight Check Songs Demo How Spark Runs on a Cluster Spark SQL A Brief History Visualizations Ex 3: WC, Joins Ex 7: SQL + Visualizations DBC Essentials Spark Streaming How to “Think Notebooks” Tw Streaming / Kinesis Demo Ex 4: Workflow exercise Building a Scala JAR Tour of Spark API Deploying Apps Spark @ Strata Ex 8: GraphX examples Case Studies Further Resources / Q&A
  • 3. Introducing: Reza Zadeh
 @Reza_Zadeh Hossein Falaki
 @mhfalaki Andy Konwinski
 @andykonwinski Chris Fregly
 @cfregly Denny Lee
 @dennylee Holden Karau
 @holdenkarau Krishna Sankar
 @ksankar Paco Nathan
 @pacoid
  • 5. Everyone will receive a username/password for one 
 of the Databricks Cloud shards. Use your laptop and browser to login there. We find that cloud-based notebooks are a simple way to get started using Apache Spark – as the motto “Making Big Data Simple” states. Please create and run a variety of notebooks on your account throughout the tutorial. These accounts will remain open long enough for you to export your work. See the product page or FAQ for more details, or contact Databricks to register for a trial account. 5 Getting Started: Step 1
  • 6. 6 Getting Started: Step 1 – Credentials url:! ! https://class01.cloud.databricks.com/! user:! ! student-777! pass:! ! 93ac11xq23z5150! cluster:! student-777!
  • 7. 7 Open in a browser window, then click on the navigation menu in the top/left corner: Getting Started: Step 2
  • 8. 8 The next columns to the right show folders,
 and scroll down to click on databricks_guide Getting Started: Step 3
  • 9. 9 Scroll to open the 01 Quick Start notebook, then follow the discussion about using key features: Getting Started: Step 4
  • 10. 10 See /databricks-guide/01 Quick Start 
 Key Features: • Workspace / Folder / Notebook • Code Cells, run/edit/move/comment • Markdown • Results • Import/Export Getting Started: Step 5
  • 11. 11 Click on the Workspace menu and create your 
 own folder (pick a name): Getting Started: Step 6
  • 12. 12 Navigate to /_SparkCamp/00.pre-flight-check
 hover on its drop-down menu, on the right side: Getting Started: Step 7
  • 13. 13 Then create a clone of this notebook in the folder that you just created: Getting Started: Step 8
  • 14. 14 Attach your cluster – same as your username: Getting Started: Step 9
  • 15. 15 Now let’s get started with the coding exercise! We’ll define an initial Spark app in three lines 
 of code: Getting Started: Coding Exercise
  • 16. If you’re new to this Scala thing and want to spend a few minutes on the basics… 16 Scala Crash Course
 Holden Karau
 lintool.github.io/SparkTutorial/ slides/day1_Scala_crash_course.pdf Getting Started: Bonus!
  • 17. 17 Getting Started: Extra Bonus!! See also the /learning_spark_book for all of its code examples in notebooks:
  • 18. How Spark runs 
 on a Cluster Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3
  • 19. 19 Clone and run /_SparkCamp/01.log_example
 in your folder: Spark Deconstructed: Log Mining Example
  • 20. Spark Deconstructed: Log Mining Example 20 # load error messages from a log into memory! # then interactively search for patterns!  ! # base RDD! lines = sc.textFile("/mnt/paco/intro/error_log.txt") ! .map(lambda x: x.split("t"))!  ! # transformed RDDs! errors = lines.filter(lambda x: x[0] == "ERROR")! messages = errors.map(lambda x: x[1])!  ! # persistence! messages.cache()!  ! # action 1! messages.filter(lambda x: x.find("mysql") > -1).count()!  ! # action 2! messages.filter(lambda x: x.find("php") > -1).count()
  • 21. Spark Deconstructed: Log Mining Example x = messages.filter(lambda x: x.find("mysql") > -1)! x.toDebugString()! ! ! (2) PythonRDD[772] at RDD at PythonRDD.scala:43 []! | PythonRDD[219] at RDD at PythonRDD.scala:43 []! | error_log.txt MappedRDD[218] at NativeMethodAccessorImpl.java:-2 []! | error_log.txt HadoopRDD[217] at NativeMethodAccessorImpl.java:-2 [] Note that we can examine the operator graph for a transformed RDD, for example: 21
  • 22. Driver Worker Worker Worker Spark Deconstructed: Log Mining Example We start with Spark running on a cluster…
 submitting code to be evaluated on it: 22
  • 23. # base RDD! lines = sc.textFile("/mnt/paco/intro/error_log.txt") ! .map(lambda x: x.split("t"))!  ! # transformed RDDs! errors = lines.filter(lambda x: x[0] == "ERROR")! messages = errors.map(lambda x: x[1])!  ! # persistence! messages.cache()!  ! # action 1! messages.filter(lambda x: x.find("mysql") > -1).count()!  ! # action 2! messages.filter(lambda x: x.find("php") > -1).count() Spark Deconstructed: Log Mining Example discussing the other part 23 Driver Worker Worker Worker
  • 24. # base RDD! lines = sc.textFile("/mnt/paco/intro/error_log.txt") ! .map(lambda x: x.split("t"))!  ! # transformed RDDs! errors = lines.filter(lambda x: x[0] == "ERROR")! messages = errors.map(lambda x: x[1])!  ! # persistence! messages.cache()!  ! # action 1! messages.filter(lambda x: x.find("mysql") > -1).count()!  ! # action 2! messages.filter(lambda x: x.find("php") > -1).count() Spark Deconstructed: Log Mining Example 24 Driver Worker Worker Worker block 1 block 2 block 3 discussing the other part
  • 25. # base RDD! lines = sc.textFile("/mnt/paco/intro/error_log.txt") ! .map(lambda x: x.split("t"))!  ! # transformed RDDs! errors = lines.filter(lambda x: x[0] == "ERROR")! messages = errors.map(lambda x: x[1])!  ! # persistence! messages.cache()!  ! # action 1! messages.filter(lambda x: x.find("mysql") > -1).count()!  ! # action 2! messages.filter(lambda x: x.find("php") > -1).count() Spark Deconstructed: Log Mining Example discussing the other part 25 Driver Worker Worker Worker block 1 block 2 block 3
  • 26. # base RDD! lines = sc.textFile("/mnt/paco/intro/error_log.txt") ! .map(lambda x: x.split("t"))!  ! # transformed RDDs! errors = lines.filter(lambda x: x[0] == "ERROR")! messages = errors.map(lambda x: x[1])!  ! # persistence! messages.cache()!  ! # action 1! messages.filter(lambda x: x.find("mysql") > -1).count()!  ! # action 2! messages.filter(lambda x: x.find("php") > -1).count() Spark Deconstructed: Log Mining Example 26 Driver Worker Worker Worker block 1 block 2 block 3 read HDFS block read HDFS block read HDFS block discussing the other part
  • 27. # base RDD! lines = sc.textFile("/mnt/paco/intro/error_log.txt") ! .map(lambda x: x.split("t"))!  ! # transformed RDDs! errors = lines.filter(lambda x: x[0] == "ERROR")! messages = errors.map(lambda x: x[1])!  ! # persistence! messages.cache()!  ! # action 1! messages.filter(lambda x: x.find("mysql") > -1).count()!  ! # action 2! messages.filter(lambda x: x.find("php") > -1).count() Spark Deconstructed: Log Mining Example 27 Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 process, cache data process, cache data process, cache data discussing the other part
  • 28. # base RDD! lines = sc.textFile("/mnt/paco/intro/error_log.txt") ! .map(lambda x: x.split("t"))!  ! # transformed RDDs! errors = lines.filter(lambda x: x[0] == "ERROR")! messages = errors.map(lambda x: x[1])!  ! # persistence! messages.cache()!  ! # action 1! messages.filter(lambda x: x.find("mysql") > -1).count()!  ! # action 2! messages.filter(lambda x: x.find("php") > -1).count() Spark Deconstructed: Log Mining Example 28 Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 discussing the other part
  • 29. # base RDD! lines = sc.textFile("/mnt/paco/intro/error_log.txt") ! .map(lambda x: x.split("t"))!  ! # transformed RDDs! errors = lines.filter(lambda x: x[0] == "ERROR")! messages = errors.map(lambda x: x[1])!  ! # persistence! messages.cache()!  ! # action 1! messages.filter(lambda x: x.find("mysql") > -1).count()!  ! # action 2! messages.filter(lambda x: x.find("php") > -1).count() Spark Deconstructed: Log Mining Example discussing the other part 29 Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3
  • 30. # base RDD! lines = sc.textFile("/mnt/paco/intro/error_log.txt") ! .map(lambda x: x.split("t"))!  ! # transformed RDDs! errors = lines.filter(lambda x: x[0] == "ERROR")! messages = errors.map(lambda x: x[1])!  ! # persistence! messages.cache()!  ! # action 1! messages.filter(lambda x: x.find("mysql") > -1).count()!  ! # action 2! messages.filter(lambda x: x.find("php") > -1).count() Spark Deconstructed: Log Mining Example 30 Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 process from cache process from cache process from cache discussing the other part
  • 31. # base RDD! lines = sc.textFile("/mnt/paco/intro/error_log.txt") ! .map(lambda x: x.split("t"))!  ! # transformed RDDs! errors = lines.filter(lambda x: x[0] == "ERROR")! messages = errors.map(lambda x: x[1])!  ! # persistence! messages.cache()!  ! # action 1! messages.filter(lambda x: x.find("mysql") > -1).count()!  ! # action 2! messages.filter(lambda x: x.find("php") > -1).count() Spark Deconstructed: Log Mining Example 31 Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 discussing the other part
  • 32. Looking at the RDD transformations and actions from another perspective… action value RDD RDD RDD transformations RDD # base RDD! lines = sc.textFile("/mnt/paco/intro/error_log.txt") ! .map(lambda x: x.split("t"))!  ! # transformed RDDs! errors = lines.filter(lambda x: x[0] == "ERROR")! messages = errors.map(lambda x: x[1])!  ! # persistence! messages.cache()!  ! # action 1! messages.filter(lambda x: x.find("mysql") > -1).count()!  ! # action 2! messages.filter(lambda x: x.find("php") > -1).count() 32 Spark Deconstructed: Log Mining Example
  • 33. RDD # base RDD! lines = sc.textFile("/mnt/paco/intro/error_log.txt") ! .map(lambda x: x.split("t")) 33 Spark Deconstructed: Log Mining Example
  • 34. RDD RDD RDD transformations RDD # transformed RDDs! errors = lines.filter(lambda x: x[0] == "ERROR")! messages = errors.map(lambda x: x[1])! ! # persistence! messages.cache() 34 Spark Deconstructed: Log Mining Example
  • 35. action value RDD RDD RDD transformations RDD # action 1! messages.filter(lambda x: x.find("mysql") > -1).count() 35 Spark Deconstructed: Log Mining Example
  • 37. 37 A Brief History: Functional Programming for Big Data circa late 1990s: 
 explosive growth e-commerce and machine data implied that workloads could not fit on a single computer anymore… notable firms led the shift to horizontal scale-out 
 on clusters of commodity hardware, especially 
 for machine learning use cases at scale
  • 38. 38 A Brief History: Functional Programming for Big Data 2002 2002 MapReduce @ Google 2004 MapReduce paper 2006 Hadoop @Yahoo! 2004 2006 2008 2010 2012 2014 2014 Apache Spark top-level 2010 Spark paper 2008 Hadoop Summit
  • 39. 39 circa 2002: 
 mitigate risk of large distributed workloads lost 
 due to disk failures on commodity hardware… Google File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung research.google.com/archive/gfs.html ! MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean, Sanjay Ghemawat research.google.com/archive/mapreduce.html A Brief History: MapReduce
  • 40. A Brief History: MapReduce circa 1979 – Stanford, MIT, CMU, etc.
 set/list operations in LISP, Prolog, etc., for parallel processing
 www-formal.stanford.edu/jmc/history/lisp/lisp.htm circa 2004 – Google
 MapReduce: Simplified Data Processing on Large Clusters
 Jeffrey Dean and Sanjay Ghemawat
 research.google.com/archive/mapreduce.html circa 2006 – Apache
 Hadoop, originating from the Nutch Project
 Doug Cutting
 research.yahoo.com/files/cutting.pdf circa 2008 – Yahoo
 web scale search indexing
 Hadoop Summit, HUG, etc.
 developer.yahoo.com/hadoop/ circa 2009 – Amazon AWS
 Elastic MapReduce
 Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc.
 aws.amazon.com/elasticmapreduce/ 40
  • 41. Open Discussion: Enumerate several changes in data center technologies since 2002… A Brief History: MapReduce 41
  • 42. pistoncloud.com/2013/04/storage- and-the-mobility-gap/ Rich Freitas, IBM Research A Brief History: MapReduce meanwhile, spinny disks haven’t changed all that much… storagenewsletter.com/rubriques/hard- disk-drives/hdd-technology-trends-ibm/ 42
  • 43. MapReduce use cases showed two major limitations: 1. difficultly of programming directly in MR 2. performance bottlenecks, or batch not fitting the use cases In short, MR doesn’t compose well for large applications Therefore, people built specialized systems as workarounds… A Brief History: MapReduce 43
  • 44. 44 MR doesn’t compose well for large applications, 
 and so specialized systems emerged as workarounds MapReduce General Batch Processing Specialized Systems: iterative, interactive, streaming, graph, etc. Pregel Giraph Dremel Drill Tez Impala GraphLab StormS4 F1 MillWheel A Brief History: MapReduce
  • 45. Developed in 2009 at UC Berkeley AMPLab, then open sourced in 2010, Spark has since become 
 one of the largest OSS communities in big data, with over 200 contributors in 50+ organizations spark.apache.org “Organizations that are looking at big data challenges –
 including collection, ETL, storage, exploration and analytics –
 should consider Spark for its in-memory performance and
 the breadth of its model. It supports advanced analytics
 solutions on Hadoop clusters, including the iterative model
 required for machine learning and graph analysis.” Gartner, Advanced Analytics and Data Science (2014) 45 A Brief History: Spark
  • 46. 46 Spark: Cluster Computing withWorking Sets Matei Zaharia, Mosharaf Chowdhury, 
 Michael Franklin, Scott Shenker, Ion Stoica
 people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf ! Resilient Distributed Datasets:A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury,Tathagata Das,Ankur Dave, 
 Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf circa 2010: 
 a unified engine for enterprise data workflows, 
 based on commodity hardware a decade later… A Brief History: Spark
  • 47. A Brief History: Spark Unlike the various specialized systems, Spark’s goal was to generalize MapReduce to support new apps within same engine Two reasonably small additions are enough to express the previous models: • fast data sharing • general DAGs This allows for an approach which is more efficient for the engine, and much simpler 
 for the end users 47
  • 49. Some key points about Spark: • handles batch, interactive, and real-time 
 within a single framework • native integration with Java, Python, Scala • programming at a higher level of abstraction • more general: map/reduce is just one set 
 of supported constructs A Brief History: Spark 49
  • 50. • generalized patterns
 unified engine for many use cases • lazy evaluation of the lineage graph
 reduces wait states, better pipelining • generational differences in hardware
 off-heap use of large memory spaces • functional programming / ease of use
 reduction in cost to maintain large apps • lower overhead for starting jobs • less expensive shuffles A Brief History: Key distinctions for Spark vs. MapReduce 50
  • 52. Spark is one of the most active Apache projects ohloh.net/orgs/apache 52 TL;DR: Sustained Exponential Growth
  • 55. Ex #3: 
 WC, Joins, Shuffles A: stage 1 B: C: stage 2 D: stage 3 E: map() map() map() map() join() cached partition RDD
  • 56. Coding Exercise: WordCount void map (String doc_id, String text):! for each word w in segment(text):! emit(w, "1");! ! ! void reduce (String word, Iterator group):! int count = 0;! ! for each pc in group:! count += Int(pc);! ! emit(word, String(count)); Definition: count how often each word appears 
 in a collection of text documents This simple program provides a good test case 
 for parallel processing, since it: • requires a minimal amount of code • demonstrates use of both symbolic and 
 numeric values • isn’t many steps away from search indexing • serves as a “HelloWorld” for Big Data apps ! A distributed computing framework that can run WordCount efficiently in parallel at scale 
 can likely handle much larger and more interesting compute problems count how often each word appears 
 in a collection of text documents 56
  • 57. WordCount in 3 lines of Spark WordCount in 50+ lines of Java MR 57 Coding Exercise: WordCount
  • 58. 58 Clone and run /_SparkCamp/02.wc_example
 in your folder: Coding Exercise: WordCount
  • 59. 59 Clone and run /_SparkCamp/03.join_example
 in your folder: Coding Exercise: Join
  • 60. A: stage 1 B: C: stage 2 D: stage 3 E: map() map() map() map() join() cached partition RDD 60 Coding Exercise: Join and its Operator Graph
  • 62. DBC Essentials evaluation optimization representation circa 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable resultsdecisions, feedback bar developers foo algorithms
  • 63. 63 DBC Essentials: What is Databricks Cloud? Databricks Platform Databricks Workspace Also see FAQ for more details…
  • 64. 64 DBC Essentials: What is Databricks Cloud? key concepts Shard an instance of Databricks Workspace Cluster a Spark cluster (multiple per shard) Notebook a list of markdown, executable commands, and results Dashboard a flexible space to create operational visualizations Also see FAQ for more details…
  • 65. 65 DBC Essentials: Notebooks • Series of commands (think shell++) • Each notebook has a language type, 
 chosen at notebook creation: • Python + SQL • Scala + SQL • SQL only • Command output captured in notebook • Commands can be… • edited, reordered, rerun, exported, 
 cloned, imported, etc.
  • 66. 66 DBC Essentials: Clusters • Open source Spark clusters hosted in the cloud • Access the Spark UI • Attach and Detach notebooks to clusters ! NB: our training shards use 7 GB cluster configurations
  • 67. 67 DBC Essentials: Team, State, Collaboration, Elastic Resources Cloud login state attached Spark cluster Shard Notebook Spark cluster detached Browser team Browserlogin import/ export Local Copies
  • 68. 68 DBC Essentials: Team, State, Collaboration, Elastic Resources Excellent collaboration properties, based on the use of: • comments • cloning • decoupled state of notebooks vs. clusters • relative independence of code blocks within a notebook
  • 69. How to “Think Notebooks”
  • 70. How to “think” in terms of leveraging notebooks, based on Computational Thinking: 70 Think Notebooks: “The way we depict space has a great deal to do with how we behave in it.”
 – David Hockney
  • 71. 71 “The impact of computing extends far beyond
 science… affecting all aspects of our lives. 
 To flourish in today's world, everyone needs
 computational thinking.” – CMU Computing now ranks alongside the proverbial Reading,Writing, and Arithmetic… Center for ComputationalThinking @ CMU
 http://www.cs.cmu.edu/~CompThink/ Exploring ComputationalThinking @ Google
 https://www.google.com/edu/computational-thinking/ Think Notebooks: ComputationalThinking
  • 72. 72 Computational Thinking provides a structured way of conceptualizing the problem… In effect, developing notes for yourself and your team These in turn can become the basis for team process, software requirements, etc., In other words, conceptualize how to leverage computing resources at scale to build high-ROI apps for Big Data Think Notebooks: ComputationalThinking
  • 73. 73 The general approach, in four parts: • Decomposition: decompose a complex problem into smaller solvable problems • Pattern Recognition: identify when a 
 known approach can be leveraged • Abstraction: abstract from those patterns 
 into generalizations as strategies • Algorithm Design: articulate strategies as algorithms, i.e. as general recipes for how to handle complex problems Think Notebooks: ComputationalThinking
  • 74. How to “think” in terms of leveraging notebooks, 
 by the numbers: 1. create a new notebook 2. copy the assignment description as markdown 3. split it into separate code cells 4. for each step, write your code under the markdown 5. run each step and verify your results 74 Think Notebooks:
  • 75. Let’s assemble the pieces of the previous few 
 code examples, using two files: /mnt/paco/intro/CHANGES.txt
 /mnt/paco/intro/README.md" 1. create RDDs to filter each line for the 
 keyword Spark 2. perform a WordCount on each, i.e., so the results are (K,V) pairs of (keyword, count) 3. join the two RDDs 4. how many instances of Spark are there in 
 each file? 75 Coding Exercises: Workflow assignment
  • 76. Tour of Spark API Cluster Manager Driver Program SparkContext Worker Node Executor cache task task Worker Node Executor cache task task
  • 77. The essentials of the Spark API in both Scala and Python… /_SparkCamp/05.scala_api" /_SparkCamp/05.python_api" ! Let’s start with the basic concepts, which are covered in much more detail in the docs: spark.apache.org/docs/latest/scala- programming-guide.html Spark Essentials: 77
  • 78. First thing that a Spark program does is create a SparkContext object, which tells Spark how to access a cluster In the shell for either Scala or Python, this is the sc variable, which is created automatically Other programs must use a constructor to instantiate a new SparkContext Then in turn SparkContext gets used to create other variables Spark Essentials: SparkContext 78
  • 79. sc" res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6ad51e9c Spark Essentials: SparkContext sc" Out[1]: <__main__.RemoteContext at 0x7ff0bfb18a10> Scala: Python: 79
  • 80. The master parameter for a SparkContext determines which cluster to use Spark Essentials: Master master description local run Spark locally with one worker thread 
 (no parallelism) local[K] run Spark locally with K worker threads 
 (ideally set to # cores) spark://HOST:PORT connect to a Spark standalone cluster; 
 PORT depends on config (7077 by default) mesos://HOST:PORT connect to a Mesos cluster; 
 PORT depends on config (5050 by default) 80
  • 81. Cluster ManagerDriver Program SparkContext Worker Node Executor cache tasktask Worker Node Executor cache tasktask spark.apache.org/docs/latest/cluster- overview.html Spark Essentials: Master 81
  • 82. Cluster ManagerDriver Program SparkContext Worker Node Executor cache tasktask Worker Node Executor cache tasktask The driver performs the following: 1. connects to a cluster manager to allocate resources across applications 2. acquires executors on cluster nodes – processes run compute tasks, cache data 3. sends app code to the executors 4. sends tasks for the executors to run Spark Essentials: Clusters 82
  • 83. Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on 
 in parallel There are currently two types: • parallelized collections – take an existing Scala collection and run functions on it in parallel • Hadoop datasets – run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop Spark Essentials: RDD 83
  • 84. • two types of operations on RDDs: 
 transformations and actions • transformations are lazy 
 (not computed immediately) • the transformed RDD gets recomputed 
 when an action is run on it (default) • however, an RDD can be persisted into 
 storage in memory or disk Spark Essentials: RDD 84
  • 85. val data = Array(1, 2, 3, 4, 5)" data: Array[Int] = Array(1, 2, 3, 4, 5)" ! val distData = sc.parallelize(data)" distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24970] Spark Essentials: RDD data = [1, 2, 3, 4, 5]" data" Out[2]: [1, 2, 3, 4, 5]" ! distData = sc.parallelize(data)" distData" Out[3]: ParallelCollectionRDD[24864] at parallelize at PythonRDD.scala:364 Scala: Python: 85
  • 86. Spark can create RDDs from any file stored in HDFS or other storage systems supported by Hadoop, e.g., local file system,Amazon S3, Hypertable, HBase, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat, and can also take a directory or a glob (e.g. /data/201404*) Spark Essentials: RDD action value RDD RDD RDD transformations RDD 86
  • 87. val distFile = sqlContext.table("readme")" distFile: org.apache.spark.sql.SchemaRDD = " SchemaRDD[24971] at RDD at SchemaRDD.scala:108 Spark Essentials: RDD distFile = sqlContext.table("readme").map(lambda x: x[0])" distFile" Out[11]: PythonRDD[24920] at RDD at PythonRDD.scala:43 Scala: Python: 87
  • 88. Transformations create a new dataset from 
 an existing one All transformations in Spark are lazy: they 
 do not compute their results right away – instead they remember the transformations applied to some base dataset • optimize the required calculations • recover from lost data partitions Spark Essentials: Transformations 88
  • 89. Spark Essentials: Transformations transformation description map(func) return a new distributed dataset formed by passing 
 each element of the source through a function func filter(func) return a new dataset formed by selecting those elements of the source on which func returns true flatMap(func) similar to map, but each input item can be mapped 
 to 0 or more output items (so func should return a 
 Seq rather than a single item) sample(withReplacement, fraction, seed) sample a fraction fraction of the data, with or without replacement, using a given random number generator seed union(otherDataset) return a new dataset that contains the union of the elements in the source dataset and the argument distinct([numTasks])) return a new dataset that contains the distinct elements of the source dataset 89
  • 90. Spark Essentials: Transformations transformation description groupByKey([numTasks]) when called on a dataset of (K, V) pairs, returns a dataset of (K, Seq[V]) pairs reduceByKey(func, [numTasks]) when called on a dataset of (K, V) pairs, returns 
 a dataset of (K, V) pairs where the values for each 
 key are aggregated using the given reduce function sortByKey([ascending], [numTasks]) when called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) 
 pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument join(otherDataset, [numTasks]) when called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs 
 of elements for each key cogroup(otherDataset, [numTasks]) when called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples – also called groupWith cartesian(otherDataset) when called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements) 90
  • 91. val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])" distFile.map(l => l.split(" ")).collect()" distFile.flatMap(l => l.split(" ")).collect() Spark Essentials: Transformations distFile = sqlContext.table("readme").map(lambda x: x[0])" distFile.map(lambda x: x.split(' ')).collect()" distFile.flatMap(lambda x: x.split(' ')).collect() Scala: Python: distFile is a collection of lines 91
  • 92. Spark Essentials: Transformations Scala: Python: closures val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])" distFile.map(l => l.split(" ")).collect()" distFile.flatMap(l => l.split(" ")).collect() distFile = sqlContext.table("readme").map(lambda x: x[0])" distFile.map(lambda x: x.split(' ')).collect()" distFile.flatMap(lambda x: x.split(' ')).collect() 92
  • 93. Spark Essentials: Transformations Scala: Python: closures val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])" distFile.map(l => l.split(" ")).collect()" distFile.flatMap(l => l.split(" ")).collect() distFile = sqlContext.table("readme").map(lambda x: x[0])" distFile.map(lambda x: x.split(' ')).collect()" distFile.flatMap(lambda x: x.split(' ')).collect() 93 looking at the output, how would you 
 compare results for map() vs. flatMap() ?
  • 94. Spark Essentials: Actions action description reduce(func) aggregate the elements of the dataset using a function func (which takes two arguments and returns one), 
 and should also be commutative and associative so 
 that it can be computed correctly in parallel collect() return all the elements of the dataset as an array at 
 the driver program – usually useful after a filter or other operation that returns a sufficiently small subset of the data count() return the number of elements in the dataset first() return the first element of the dataset – similar to take(1) take(n) return an array with the first n elements of the dataset – currently not executed in parallel, instead the driver program computes all the elements takeSample(withReplacement, fraction, seed) return an array with a random sample of num elements of the dataset, with or without replacement, using the given random number generator seed 94
  • 95. Spark Essentials: Actions action description saveAsTextFile(path) write the elements of the dataset as a text file (or set 
 of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert 
 it to a line of text in the file saveAsSequenceFile(path) write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. 
 Only available on RDDs of key-value pairs that either implement Hadoop's Writable interface or are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). countByKey() only available on RDDs of type (K, V). Returns a 
 `Map` of (K, Int) pairs with the count of each key foreach(func) run a function func on each element of the dataset – usually done for side effects such as updating an accumulator variable or interacting with external storage systems 95
  • 96. val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])" val words = f.flatMap(l => l.split(" ")).map(word => (word, 1))" words.reduceByKey(_ + _).collect.foreach(println) Spark Essentials: Actions from operator import add" f = sqlContext.table("readme").map(lambda x: x[0])" words = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))" words.reduceByKey(add).collect() Scala: Python: 96
  • 97. Spark can persist (or cache) a dataset in memory across operations spark.apache.org/docs/latest/programming-guide.html#rdd- persistence Each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset – often making future actions more than 10x faster The cache is fault-tolerant: if any partition 
 of an RDD is lost, it will automatically be recomputed using the transformations that originally created it Spark Essentials: Persistence 97
  • 98. Spark Essentials: Persistence transformation description MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. 
 If the RDD does not fit in memory, some partitions 
 will not be cached and will be recomputed on the fly each time they're needed.This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. 
 If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array 
 per partition).This is generally more space-efficient 
 than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc Same as the levels above, but replicate each partition 
 on two cluster nodes. OFF_HEAP (experimental) Store RDD in serialized format in Tachyon. 98
  • 99. val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])" val words = f.flatMap(l => l.split(" ")).map(word => (word, 1)).cache()" words.reduceByKey(_ + _).collect.foreach(println) Spark Essentials: Persistence from operator import add" f = sqlContext.table("readme").map(lambda x: x[0])" w = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).cache()" w.reduceByKey(add).collect() Scala: Python: 99
  • 100. Broadcast variables let programmer keep a read-only variable cached on each machine rather than shipping a copy of it with tasks For example, to give every node a copy of 
 a large input dataset efficiently Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost Spark Essentials: BroadcastVariables 100
  • 101. val broadcastVar = sc.broadcast(Array(1, 2, 3))" broadcastVar.value! res10: Array[Int] = Array(1, 2, 3) Spark Essentials: BroadcastVariables broadcastVar = sc.broadcast(list(range(1, 4)))" broadcastVar.value! Out[15]: [1, 2, 3] Scala: Python: 101
  • 102. Accumulators are variables that can only be “added” to through an associative operation Used to implement counters and sums, efficiently in parallel Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend 
 for new types Only the driver program can read an accumulator’s value, not the tasks Spark Essentials: Accumulators 102
  • 103. val accum = sc.accumulator(0)" sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)" ! accum.value! res11: Int = 10 Spark Essentials: Accumulators accum = sc.accumulator(0)" rdd = sc.parallelize([1, 2, 3, 4])" def f(x):" global accum" accum += x" ! rdd.foreach(f)" ! accum.value! Out[16]: 10 Scala: Python: 103
  • 104. val accum = sc.accumulator(0)" sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)" ! accum.value! res11: Int = 10 Spark Essentials: Accumulators accum = sc.accumulator(0)" rdd = sc.parallelize([1, 2, 3, 4])" def f(x):" global accum" accum += x" ! rdd.foreach(f)" ! accum.value! Out[16]: 10 Scala: Python: driver-side 104
  • 105. For a deep-dive about broadcast variables and accumulator usage in Spark, see also: Advanced Spark Features
 Matei Zaharia, Jun 2012
 ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei- zaharia-amp-camp-2012-advanced-spark.pdf Spark Essentials: BroadcastVariables and Accumulators 105
  • 106. val pair = (a, b)"  " pair._1 // => a" pair._2 // => b Spark Essentials: (K,V) pairs Scala: 106 Python: pair = (a, b)"  " pair[0] # => a" pair[1] # => b
  • 107. Spark Essentials: API Details For more details about the Scala API: spark.apache.org/docs/latest/api/scala/ index.html#org.apache.spark.package For more details about the Python API: spark.apache.org/docs/latest/api/python/ 107
  • 109. Keynote: New Directions for Spark in 2015 Fri Feb 20 9:15am-9:25am
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/39547 As the Apache Spark userbase grows, the developer community is working 
 to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the enterprise and major improvements in its performance, scalability and standard libraries. In 2015, we want to make Spark accessible to a wider 
 set of users, through new high-level APIs for data science: machine learning pipelines, data frames, and R language bindings. In addition, we are defining extension points to let Spark grow as a platform, making it easy to plug in data sources, algorithms, and external packages. Like all work on Spark, 
 these APIs are designed to plug seamlessly into Spark applications, giving users a unified platform for streaming, batch and interactive data processing. Matei Zaharia – started the Spark project at UC Berkeley, currently CTO of Databricks, SparkVP at Apache, and an assistant professor at MIT
  • 110. Spark Camp: Ask Us Anything Fri, Feb 20 2:20pm-3:00pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/40701 Join the Spark team for an informal question and answer session. Several of the Spark committers, trainers, etc., from Databricks will be on hand to field a wide range of detailed questions. Even if you don’t have a specific question, join 
 in to hear what others are asking!
  • 111. Databricks Spark Talks @Strata + Hadoop World Thu Feb 19 10:40am-11:20am
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/38237 Lessons from Running Large Scale SparkWorkloads
 Reynold Xin, Matei Zaharia Thu Feb 19 4:00pm–4:40pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/38518 Spark Streaming -The State of the Union, and Beyond
 Tathagata Das
  • 112. Databricks Spark Talks @Strata + Hadoop World Fri Feb 20 11:30am-12:10pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/38237 Tuning and Debugging in Apache Spark
 Patrick Wendell Fri Feb 20 4:00pm–4:40pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/38391 Everyday I’m Shuffling -Tips forWriting Better Spark Programs
 Vida Ha, Holden Karau
  • 113. Spark Developer Certification
 Fri Feb 20, 2015 10:40am-12:40pm • go.databricks.com/spark-certified-developer • defined by Spark experts @Databricks • assessed by O’Reilly Media • establishes the bar for Spark expertise
  • 114. • 40 multiple-choice questions, 90 minutes • mostly structured as choices among code blocks • expect some Python, Java, Scala, SQL • understand theory of operation • identify best practices • recognize code that is more parallel, less memory constrained ! Overall, you need to write Spark apps in practice Developer Certification: Overview 114
  • 119. blurs the lines between RDDs and relational tables spark.apache.org/docs/latest/sql-programming- guide.html ! intermix SQL commands to query external data, along with complex analytics, in a single app: • allows SQL extensions based on MLlib • provides the “heavy lifting” for ETL in DBC Spark SQL: Manipulating Structured Data Using Spark
 Michael Armbrust, Reynold Xin (2014-03-24)
 databricks.com/blog/2014/03/26/Spark-SQL- manipulating-structured-data-using-Spark.html 119 Spark SQL: DataWorkflows
  • 120. Parquet is a columnar format, supported 
 by many different Big Data frameworks http://parquet.io/ Spark SQL supports read/write of parquet files, 
 automatically preserving the schema of the 
 original data (HUGE benefits) Modifying the previous example… 120 Spark SQL: DataWorkflows – Parquet
  • 121. 121 Demo of /_SparkCamp/demo_sql_scala 
 by the instructor: Spark SQL: SQL Demo
  • 122. 122 Next, we’ll review the following sections in the
 Databricks Guide: /databricks_guide/Importing Data" /databricks_guide/Databricks File System" ! Key Topics: • JSON, CSV, Parquet • S3, Hive, Redshift • DBFS, dbutils Spark SQL: Using DBFS
  • 124. 124 For any SQL query, you can show the results 
 as a table, or generate a plot from with a 
 single click… Visualization: Built-in Plots
  • 125. 125 Several of the plot types have additional options 
 to customize the graphs they generate… Visualization: Plot Options
  • 126. 126 For example, series groupings can be used to help organize bar charts… Visualization: Series Groupings
  • 127. 127 See /databricks-guide/05 Visualizations 
 for details about built-in visualizations and extensions… Visualization: Reference Guide
  • 128. 128 The display() command: • programmatic access to visualizations • pass a SchemaRDD to print as an HTML table • pass a Scala list to print as an HTML table • call without arguments to display matplotlib figures Visualization: Using display()
  • 129. 129 The displayHTML() command: • render any arbitrary HTML/JavaScript • include JavaScript libraries (advanced feature) • paste in D3 examples to get a sense for this… Visualization: Using displayHTML()
  • 130. 130 Clone the entire folder /_SparkCamp/Viz D3
 into your folder and run its notebooks: Demo: D3Visualization
  • 131. 131 Clone and run /_SparkCamp/07.sql_visualization
 in your folder: Coding Exercise: SQL +Visualization
  • 134. Let’s consider the top-level requirements for 
 a streaming framework: • clusters scalable to 100’s of nodes • low-latency, in the range of seconds
 (meets 90% of use case needs) • efficient recovery from failures
 (which is a hard problem in CS) • integrates with batch: many co’s run the 
 same business logic both online+offline Spark Streaming: Requirements 134
  • 135. Therefore, run a streaming computation as: 
 a series of very small, deterministic batch jobs ! • Chop up the live stream into 
 batches of X seconds • Spark treats each batch of 
 data as RDDs and processes 
 them using RDD operations • Finally, the processed results 
 of the RDD operations are 
 returned in batches Spark Streaming: Requirements 135
  • 136. Therefore, run a streaming computation as: 
 a series of very small, deterministic batch jobs ! • Batch sizes as low as ½ sec, 
 latency of about 1 sec • Potential for combining 
 batch processing and 
 streaming processing in 
 the same system Spark Streaming: Requirements 136
  • 137. Data can be ingested from many sources: 
 Kafka, Flume, Twitter, ZeroMQ,TCP sockets, etc. Results can be pushed out to filesystems, databases, live dashboards, etc. Spark’s built-in machine learning algorithms and graph processing algorithms can be applied to data streams Spark Streaming: Integration 137
  • 138. 2012 project started 2013 alpha release (Spark 0.7) 2014 graduated (Spark 0.9) Spark Streaming: Timeline Discretized Streams:A Fault-Tolerant Model 
 for Scalable Stream Processing Matei Zaharia,Tathagata Das, Haoyuan Li, 
 Timothy Hunter, Scott Shenker, Ion Stoica Berkeley EECS (2012-12-14) www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf project lead: 
 Tathagata Das @tathadas 138
  • 139. Typical kinds of applications: • datacenter operations • web app funnel metrics • ad optimization • anti-fraud • telecom • video analytics • various telematics and much much more! Spark Streaming: Use Cases 139
  • 140. Programming Guide
 spark.apache.org/docs/latest/streaming- programming-guide.html TD @ Spark Summit 2014
 youtu.be/o-NXwFrNAWQ?list=PLTPXxbhUt- YWGNTaDj6HSjnHMxiTD1HCR “Deep Dive into Spark Streaming”
 slideshare.net/spark-project/deep- divewithsparkstreaming- tathagatadassparkmeetup20130617 Spark Reference Applications
 databricks.gitbooks.io/databricks-spark- reference-applications/ Spark Streaming: Some Excellent Resources 140
  • 141. import org.apache.spark.streaming._" import org.apache.spark.streaming.StreamingContext._" ! // create a StreamingContext with a SparkConf configuration" val ssc = new StreamingContext(sparkConf, Seconds(10))" ! // create a DStream that will connect to serverIP:serverPort" val lines = ssc.socketTextStream(serverIP, serverPort)" ! // split each line into words" val words = lines.flatMap(_.split(" "))" ! // count each word in each batch" val pairs = words.map(word => (word, 1))" val wordCounts = pairs.reduceByKey(_ + _)" ! // print a few of the counts to the console" wordCounts.print()" ! ssc.start()" ssc.awaitTermination() Quiz: name the bits and pieces… 141
  • 142. import sys" from pyspark import SparkContext" from pyspark.streaming import StreamingContext" ! sc = SparkContext(appName="PyStreamNWC", master="local[*]")" ssc = StreamingContext(sc, Seconds(5))" ! lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))" ! counts = lines.flatMap(lambda line: line.split(" ")) " .map(lambda word: (word, 1)) " .reduceByKey(lambda a, b: a+b)" ! counts.pprint()" ! ssc.start()" ssc.awaitTermination() Demo: PySpark Streaming NetworkWord Count 142
  • 143. import sys" from pyspark import SparkContext" from pyspark.streaming import StreamingContext" ! def updateFunc (new_values, last_sum):" return sum(new_values) + (last_sum or 0)" ! sc = SparkContext(appName="PyStreamNWC", master="local[*]")" ssc = StreamingContext(sc, Seconds(5))" ssc.checkpoint("checkpoint")" ! lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))" ! counts = lines.flatMap(lambda line: line.split(" ")) " .map(lambda word: (word, 1)) " .updateStateByKey(updateFunc) " .transform(lambda x: x.sortByKey())" ! counts.pprint()" ! ssc.start()" ssc.awaitTermination() Demo: PySpark Streaming NetworkWord Count - Stateful 143
  • 146. 146 Streaming: collect tweets Twitter API HDFS: dataset Spark SQL: ETL, queries MLlib: train classifier Spark: featurize HDFS: model Streaming: score tweets language filter Demo: Twitter Streaming Language Classifier
  • 147. 147 1. extract text from the tweet https://twitter.com/ andy_bf/status/ 16222269370011648 "Ceci n'est pas un tweet" 2. sequence text as bigrams tweet.sliding(2).toSeq ("Ce", "ec", "ci", …, ) 3. convert bigrams into numbers seq.map(_.hashCode()) (2178, 3230, 3174, …, ) 4. index into sparse tf vector! seq.map(_.hashCode() % 1000) (178, 230, 174, …, ) 5. increment feature count Vector.sparse(1000, …) (1000, [102, 104, …], [0.0455, 0.0455, …]) Demo: Twitter Streaming Language Classifier From tweets to ML features, approximated as sparse vectors:
  • 149. SBT is the Simple Build Tool for Scala: www.scala-sbt.org/ This is included with the Spark download, and 
 does not need to be installed separately. Similar to Maven, however it provides for incremental compilation and an interactive shell, 
 among other innovations. SBT project uses StackOverflow for Q&A, 
 that’s a good resource to study further: stackoverflow.com/tags/sbt Spark in Production: Build: SBT 149
  • 150. Spark in Production: Build: SBT command description clean delete all generated files 
 (in the target directory) package create a JAR file run run the JAR 
 (or main class, if named) compile compile the main sources 
 (in src/main/scala and src/main/java directories) test compile and run all tests console launch a Scala interpreter help display detailed help for specified commands 150
  • 151. builds: • build/run a JAR using Java + Maven • SBT primer • build/run a JAR using Scala + SBT Spark in Production: Build: Scala 151
  • 152. The following sequence shows how to build 
 a JAR file from a Scala app, using SBT • First, this requires the “source” download, not the “binary” • Connect into the SPARK_HOME directory • Then run the following commands… Spark in Production: Build: Scala 152
  • 153. # Scala source + SBT build script on following slides! ! cd simple-app" ! ../sbt/sbt -Dsbt.ivy.home=../sbt/ivy package" ! ../spark/bin/spark-submit " --class "SimpleApp" " --master local[*] " target/scala-2.10/simple-project_2.10-1.0.jar Spark in Production: Build: Scala 153
  • 154. /*** SimpleApp.scala ***/" import org.apache.spark.SparkContext" import org.apache.spark.SparkContext._" ! object SimpleApp {" def main(args: Array[String]) {" val logFile = "README.md" // Should be some file on your system" val sc = new SparkContext("local", "Simple App", "SPARK_HOME"," List("target/scala-2.10/simple-project_2.10-1.0.jar"))" val logData = sc.textFile(logFile, 2).cache()" ! val numAs = logData.filter(line => line.contains("a")).count()" val numBs = logData.filter(line => line.contains("b")).count()" ! println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))" }" } Spark in Production: Build: Scala 154
  • 155. name := "Simple Project"" ! version := "1.0"" ! scalaVersion := "2.10.4"" ! libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.2.0"" ! resolvers += "Akka Repository" at "http://repo.akka.io/releases/" Spark in Production: Build: Scala 155
  • 157. Spark in Production: Databricks Cloud 157 Databricks Platform Databricks Workspace Arguably, one of the simplest ways to deploy Apache Spark is to use Databricks Cloud for cloud-based notebooks
  • 158. DataStax Enterprise 4.6: www.datastax.com/docs/latest-dse/ datastax.com/dev/blog/interactive-advanced-analytic-with-dse- and-spark-mllib • integration of Cassandra (elastic storage) and 
 Spark (elastic compute) • github.com/datastax/spark-cassandra-connector • exposes Cassandra tables as Spark RDDs • converts data types between Cassandra and Scala • allows for execution of arbitrary CQL statements • plays nice with CassandraVirtual Nodes Spark in Production: DSE 158
  • 159. Apache Mesos, from which Apache Spark 
 originated… Running Spark on Mesos
 spark.apache.org/docs/latest/running-on-mesos.html Run Apache Spark on Apache Mesos
 tutorial based on Mesosphere + Google Cloud
 ceteri.blogspot.com/2014/09/spark-atop-mesos-on-google-cloud.html Getting Started Running Apache Spark on Apache Mesos
 O’Reilly Media webcast
 oreilly.com/pub/e/2986 Spark in Production: Mesos 159
  • 160. Cloudera Manager 4.8.x: cloudera.com/content/cloudera-content/cloudera- docs/CM4Ent/latest/Cloudera-Manager-Installation- Guide/cmig_spark_installation_standalone.html • 5 steps to install the Spark parcel • 5 steps to configure and start the Spark service Also check out Cloudera Live: cloudera.com/content/cloudera/en/products-and- services/cloudera-live.html Spark in Production: CM 160
  • 161. MapR Technologies provides support for running 
 Spark on the MapR distros: mapr.com/products/apache-spark slideshare.net/MapRTechnologies/map-r- databricks-webinar-4x3 Spark in Production: MapR 161
  • 162. Hortonworks provides support for running 
 Spark on HDP: spark.apache.org/docs/latest/hadoop-third-party- distributions.html hortonworks.com/blog/announcing-hdp-2-1-tech- preview-component-apache-spark/ Spark in Production: HDP 162
  • 163. Running Spark on Amazon AWS EC2: blogs.aws.amazon.com/bigdata/post/Tx15AY5C50K70RV/ Installing-Apache-Spark-on-an-Amazon-EMR-Cluster Spark in Production: EC2 163
  • 164. Spark in MapReduce (SIMR) – quick way 
 for Hadoop MR1 users to deploy Spark: databricks.github.io/simr/ spark-summit.org/talk/reddy-simr-let-your- spark-jobs-simmer-inside-hadoop-clusters/ • Sparks run on Hadoop clusters without 
 any install or required admin rights • SIMR launches a Hadoop job that only 
 contains mappers, includes Scala+Spark ./simr jar_file main_class parameters 
 [—outdir=] [—slots=N] [—unique] Spark in Production: SIMR 164
  • 165. review UI features spark.apache.org/docs/latest/monitoring.html http://<master>:8080/ http://<master>:50070/ • verify: is my job still running? • drill-down into workers and stages • examine stdout and stderr • discuss how to diagnose / troubleshoot Spark in Production: Monitor 165
  • 166. Spark in Production: Monitor – Spark Console 166
  • 167. Spark in Production: Monitor – AWS Console 167
  • 170. 170 PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
 J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin
 graphlab.org/files/osdi2012-gonzalez-low-gu- bickson-guestrin.pdf Pregel: Large-scale graph computing at Google
 Grzegorz Czajkowski, et al.
 googleresearch.blogspot.com/2009/06/large-scale- graph-computing-at-google.html GraphX: Unified Graph Analytics on Spark
 Ankur Dave, Databricks
 databricks-training.s3.amazonaws.com/slides/ graphx@sparksummit_2014-07.pdf Advanced Exercises: GraphX
 databricks-training.s3.amazonaws.com/graph- analytics-with-graphx.html GraphX: Further Reading…
  • 171. // http://spark.apache.org/docs/latest/graphx-programming-guide.html! ! import org.apache.spark.graphx._" import org.apache.spark.rdd.RDD" ! case class Peep(name: String, age: Int)" ! val nodeArray = Array(" (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31))," (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39))," (5L, Peep("Leslie", 45))" )" val edgeArray = Array(" Edge(2L, 1L, 7), Edge(2L, 4L, 2)," Edge(3L, 2L, 4), Edge(3L, 5L, 3)," Edge(4L, 1L, 1), Edge(5L, 3L, 9)" )" ! val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)" val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)" val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)" ! val results = g.triplets.filter(t => t.attr > 7)" ! for (triplet <- results.collect) {" println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")" } 171 GraphX: Example – simple traversals
  • 172. 172 GraphX: Example – routing problems cost 4 node 0 node 1 node 3 node 2 cost 3 cost 1 cost 2 cost 1 What is the cost to reach node 0 from any other node in the graph? This is a common use case for graph algorithms, e.g., Djikstra
  • 173. 173 Clone and run /_SparkCamp/08.graphx
 in your folder: GraphX: Coding Exercise
  • 175. Case Studies: Apache Spark, DBC, etc. Additional details about production deployments for Apache Spark can be found at: https://cwiki.apache.org/confluence/display/ SPARK/Powered+By+Spark https://databricks.com/blog/category/company/ partners http://go.databricks.com/customer-case-studies 175
  • 176. Case Studies: Automatic Labs 176 Spark Plugs IntoYour Car
 Rob Ferguson
 spark-summit.org/east/2015/talk/spark-plugs-into-your-car finance.yahoo.com/news/automatic-labs-turns-databricks- cloud-140000785.html Automatic creates personalized driving habit dashboards • wanted to use Spark while minimizing investment in DevOps • provides data access to non-technical analysts via SQL • replaced Redshift and disparate ML tools with single platform • leveraged built-in visualization capabilities in notebooks to generate dashboards easily and quickly • used MLlib on Spark for needed functionality out of the box
  • 177. Spark atTwitter: Evaluation & Lessons Learnt
 Sriram Krishnan
 slideshare.net/krishflix/seattle-spark-meetup-spark-at-twitter • Spark can be more interactive, efficient than MR • support for iterative algorithms and caching • more generic than traditional MapReduce • Why is Spark faster than Hadoop MapReduce? • fewer I/O synchronization barriers • less expensive shuffle • the more complex the DAG, the greater the 
 performance improvement 177 Case Studies: Twitter
  • 178. Pearson uses Spark Streaming for next generation adaptive learning platform Dibyendu Bhattacharya databricks.com/blog/2014/12/08/pearson- uses-spark-streaming-for-next-generation- adaptive-learning-platform.html 178 • Kafka + Spark + Cassandra + Blur, on AWS on aYARN cluster • single platform/common API was a key reason to replace Storm with Spark Streaming • custom Kafka Consumer for Spark Streaming, using Low Level Kafka Consumer APIs • handles: Kafka node failures, receiver failures, leader changes, committed offset in ZK, tunable data rate throughput Case Studies: Pearson
  • 179. UnlockingYour Hadoop Data with Apache Spark and CDH5 Denny Lee slideshare.net/Concur/unlocking-your-hadoop-data- with-apache-spark-and-cdh5 179 • leading provider of spend management solutions and services • delivers recommendations based on business users’ travel and expenses – “to help deliver the perfect trip” • use of traditional BI tools with Spark SQL allowed analysts to make sense of the data without becoming programmers • needed the ability to transition quickly between Machine Learning (MLLib), Graph (GraphX), and SQL usage • needed to deliver recommendations in real-time Case Studies: Concur
  • 180. Stratio Streaming: a new approach to Spark Streaming David Morales, Oscar Mendez spark-summit.org/2014/talk/stratio-streaming-a- new-approach-to-spark-streaming 180 • Stratio Streaming is the union of a real-time messaging bus with a complex event processing engine atop Spark Streaming • allows the creation of streams and queries on the fly • paired with Siddhi CEP engine and Apache Kafka • added global features to the engine such as auditing and statistics Case Studies: Stratio
  • 181. Collaborative Filtering with Spark
 Chris Johnson
 slideshare.net/MrChrisJohnson/collaborative-filtering-with- spark • collab filter (ALS) for music recommendation • Hadoop suffers from I/O overhead • show a progression of code rewrites, converting a 
 Hadoop-based app into efficient use of Spark 181 Case Studies: Spotify
  • 182. Guavus Embeds Apache Spark 
 into its Operational Intelligence Platform 
 Deployed at theWorld’s LargestTelcos Eric Carr databricks.com/blog/2014/09/25/guavus-embeds-apache-spark- into-its-operational-intelligence-platform-deployed-at-the- worlds-largest-telcos.html 182 • 4 of 5 top mobile network operators, 3 of 5 top Internet backbone providers, 80% MSOs in NorAm • analyzing 50% of US mobile data traffic, +2.5 PB/day • latency is critical for resolving operational issues before they cascade: 2.5 MM transactions per second • “analyze first” not “store first ask questions later” Case Studies: Guavus
  • 183. Case Studies: Radius Intelligence 183 From Hadoop to Spark in 4 months, Lessons Learned
 Alexis Roos
 http://youtu.be/o3-lokUFqvA • building a full SMB index took 12+ hours using 
 Hadoop and Cascading • pipeline was difficult to modify/enhance • Spark increased pipeline performance 10x • interactive shell and notebooks enabled data scientists 
 to experiment and develop code faster • PMs and business development staff can use SQL to 
 query large data sets
  • 185. community: spark.apache.org/community.html events worldwide: goo.gl/2YqJZK ! video+preso archives: spark-summit.org resources: databricks.com/spark-training-resources workshops: databricks.com/spark-training
  • 186. Further Resources: Spark Packages 186 Looking for other libraries and features? There are a variety of third-party packages available at: http://spark-packages.org/
  • 187. Further Resources: DBC Feedback 187 Other feedback, suggestions, etc.? http://feedback.databricks.com/
  • 188. MOOCs: Anthony Joseph
 UC Berkeley begins Apr 2015 edx.org/course/uc-berkeleyx/uc- berkeleyx-cs100-1x- introduction-big-6181 Ameet Talwalkar
 UCLA begins Q2 2015 edx.org/course/uc-berkeleyx/ uc-berkeleyx-cs190-1x- scalable-machine-6066
  • 189. confs: Spark Summit East
 NYC, Mar 18-19
 spark-summit.org/east QCon SP
 Saõ Paulo, Brazil, Mar 23-27
 qconsp.com Big Data Tech Con
 Boston, Apr 26-28
 bigdatatechcon.com Strata EU
 London, May 5-7
 strataconf.com/big-data-conference-uk-2015 GOTO Chicago
 Chicago, May 11-14
 gotocon.com/chicago-2015 Spark Summit 2015
 SF, Jun 15-17
 spark-summit.org
  • 191. books: Fast Data Processing 
 with Spark
 Holden Karau
 Packt (2013)
 shop.oreilly.com/product/ 9781782167068.do Spark in Action
 Chris Fregly
 Manning (2015)
 sparkinaction.com/ Learning Spark
 Holden Karau, 
 Andy Konwinski, Matei Zaharia
 O’Reilly (2015)
 shop.oreilly.com/product/ 0636920028512.do
  • 192. About Databricks • Founded by the creators of Spark in 2013 • Largest organization contributing to Spark • End-to-end hosted service, Databricks Cloud • http://databricks.com/