SlideShare a Scribd company logo
1 of 78
Jump Start with
Apache® Spark™ 2.0
on Databricks
Jules S. Damji
Spark Community Evangelist
Fremont Big Data & Cloud Meetup
@2twitme
$ whoami
• Spark Community Evangelist @ Databricks
• Previously Developer Advocate @ Hortonworks
• In the past engineering roles at:
• Sun Microsystems, Netscape, @Home,VeriSign,
Scalix, Centrify, LoudCloud/Opsware, ProQuest
• jules@databricks.com
• https://www.linkedin.com/in/dmatrix
Agenda for the next 2+ hours
• Get to know Databricks
• Overview Spark Architecture
• What’s New in Spark 2.0
• Unified APIs: SparkSessions,
SQL, DataFrames, Datasets…
• Workshop Notebook 1
• Break…
Hour 1
• Introduction to DataFrames,
DataSets and Spark SQL
• Workshop Notebook 2
• Introduction to Structured
Streaming Concepts
• Workshop 3/Demo
• Go Home…
Hour 2+
Get to know Databricks
• Get Databricks community edition http://databricks.com/try-
databricks
We are Databricks, the company behind Apache Spark
Founded by the creators
of Apache Spark in 2013
Share of Spark code
contributed by
Databricks
in 2014
75%
5
Data Value
Created Databricks on top of Spark to make big data simple.
Unified engine across diverse workloads & environments
Apache Spark Architecture
Deployments
Modes
• Local
• Standalone
• YARN
• Mesos
Apache Spark Architecture
An Anatomy of an Application
Spark Application
• Jobs
• Stages
• Tasks
How did we Get Here..?
Where we Going..?
A Brief History
10
2012
Started
@
UC Berkeley
2010
research
paper
2013
Databricks
started
& donated
to ASF
2014
Spark 1.0 & libraries
(SQL, ML, GraphX)
2015
DataFrames/Datasets
Tungsten
ML Pipelines
2016
Apache Spark 2.0
Easier
Smart
er
Faster
Apache Spark 2.0
• Steps to Bigger & Better Things….
Builds on all we learned in past 2 years
Major Features in Apache Spark 2.0
Tungsten Phase 2
speedups of 5-10x
& Catalyst Optimizer
Faster
Structured Streaming
real-time engine
on SQL / DataFrames
Smarter
Unifying Datasets
and DataFrames &
SparkSessions
Easier
Unified API Foundation for the
Future: Spark Sessions, Dataset,
DataFrame, MLlib, Structured
Streaming…
Towards SQL 2003
• Today, Spark can run all 99 TPC-DS queries!
- New standard compliant parser (with good error
messages!)
- Subqueries (correlated & uncorrelated)
- Approximate aggregate stats
- https://databricks.com/blog/2016/07/26/introducing-
apache-spark-2-0.html
- https://databricks.com/blog/2016/06/17/sql-subqueries-in-
apache-spark-2-0.html
0
100
200
300
400
500
600
Runtime(seconds) Preliminary TPC-DS Spark 2.0 vs 1.6 – Lower is Better
Time (1.6)
Time (2.0)
SparkSession – A Unified entry point to
Spark
• SparkSession is the “SparkContext” for
Dataset/DataFrame
- Entry point for reading data and writing data
- Working with metadata
- Setting Spark Configuration
- Driver uses for Cluster resource management
SparkSession vs SparkContext
SparkSessions Subsumes
• SparkContext
• SQLContext
• HiveContext
• StreamingContext
• SparkConf
Datasets and DataFrames
Long Term
• RDD will remain the low-level API in Spark
• For control and certain type-safety in Java/Scala
• Datasets & DataFrames give richer semantics
and optimizations
• For semi-structured data and DSL like operations
• New libraries will increasingly use these as
interchange format
• Examples: Structured Streaming, MLlib,
GraphFrames
– A Tale of Three APIs: RDDs, DataFrames and Datasets
Other notable API improvements
• DataFrame-based ML pipeline API becoming the main
MLlib API
• ML model & pipeline persistence with almost complete
coverage
• In all programming languages: Scala, Java, Python, R
• Improved R support
• (Parallelizable) User-defined functions in R
• Generalized Linear Models (GLMs), Naïve Bayes, Survival
Regression, K-Means
Workshop: Notebook on SparkSession
• Import Notebook into your Spark 2.0 Cluster
– http://dbricks.co/sswksh1
– http://docs.databricks.com
– http://spark.apache.org/docs/latest/api/scala/index.html#or
g.apache.spark.sql.SparkSession
• Familiarize your self with Databricks Notebook environment
• Work through each cell
• CNTR + <return> / Shift + Return
• Try challenges
• Break…
DataFrames/Datasets & Spark
SQL & Catalyst Optimizer
The not so secret truth…
SQL
is not about SQL
is about more than
SQL
Spark SQL: The whole story
1
0
Is About Creating and Running Spark
Programs Faster:
• Write less code
• Read less data
• Let the optimizer do the hard work
Spark SQL Architecture
Logical
Plan
Physica
l Plan
Catalog
Optimizer
RDDs
…
Data
Source
API
SQL
DataFrame
s
Code
Generator
Datasets
28
Using Catalyst in Spark SQL
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
Analysis: analyzing a logical plan to resolve references
Logical Optimization: logical plan optimization
Physical Planning: Physical planning
Code Generation: Compile parts of the query to Java
bytecode
SQL AST
DataFrame
Datasets
Catalyst Optimizations
Logical Optimizations
Create Physical Plan &
generate JVM bytecode
• Push filter predicates down to data source,
so irrelevant data can be skipped
• Parquet: skip entire blocks, turn
comparisons on strings into cheaper
integer comparisons via dictionary
encoding
• RDBMS: reduce amount of data traffic by
pushing predicates down
• Catalyst compiles operations into physical
plans for execution and generates JVM
bytecode
• Intelligently choose between broadcast
joins and shuffle joins to reduce network
traffic
• Lower level optimizations: eliminate
expensive object allocations and reduce
virtual function calls
# Load partitioned Hive table
def add_demographics(events):
u = sqlCtx.table("users")
events 
. j o i n ( u , events.user_id == u.user_id) 
.withColumn("city", zipToCity(df.zip))
# Join on user_id
# Run udf to add c i t y column
Physical Plan
with Predicate
Pushdown and
Column Pruning
join
optimize
d scan
(events)
optimize
d scan
(users)
events = add_demographics(sqlCtx.load("/data/events", "parquet"))
training_data = events.where(events.city == "New York").select(events.timestamp).collect()
Logical
Plan
filter
join
Physical
Plan
join
scan
(users
)
events
file
users
table
30
scan
(event
s)
filter
Columns: Predicate pushdown
spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "people")
.load()
.where($"name" === "michael")
31
You Write
Spark Translates
For Postgres SELECT * FROM people WHERE name = 'michael'
4
3
Spark Core (RDD)
Catalyst
DataFrame/DatasetSQL
ML
Pipelines
Structure
d
Streamin
g
{ JSON }
JDBC
and
more…
Foundational Spark 2.0 Components
Spark
SQL
GraphFrame
s
http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
Dataset Spark 2.0 APIs
Background: What is in an RDD?
•Dependencies
• Partitions (with optional locality info)
• Compute function: Partition =>
Iterator[T]
Opaque
Computation
& Opaque Data
Structured APIs In Spark
36
SQL DataFrames Datasets
Syntax
Errors
Analysis
Errors
Runtime Compile
Time
Runtime
Compile
Time
Compile
Time
Runtime
Analysis errors are reported before a distributed job starts
Dataset API in Spark 2.0
• Typed interface over DataFrames / Tungsten
• case class Person(name: String, age: Int)
• val dataframe = spark.read.json(“people.json”)
• val ds: Dataset[Person] = dataframe.as[Person]
• ds.filter(p => p.name.startsWith(“M”))
.groupBy(“name”)
.avg(“age”)
Type-safe:
operate on
domain objects
with compiled
lambda functions
8
Datasets API
val df = spark.read.json("people.json")
/ / Convert data to domain objects.
case class Person(name: S tri ng , age: I n t )
val ds: Dataset[Person] = df.as[Person]
d s . f i l t e r ( _ . a g e > 30)
/ / Compute histogram of age by name.
val h i s t = ds.groupBy(_.name).mapGroups {
case (name, people: Iter[Person]) =>
val buckets = newArray[Int](10)
people.map(_.age).foreach { a =>
buckets(a / 10) += 1
}
(name, buckets)
}
Source: michaelmalak
Project Tungsten II
Project Tungsten
• Substantially speed up execution by optimizing
CPU efficiency, via: SPARK-12795
(1) Runtime code generation
(2) Exploiting cache locality
(3) Off-heap memory management
6 “bricks
”
Tungsten’s Compact Row Format
0x
0
12
3
32
L
48
L
4 “data”
(123, “data”,
“bricks”)
Null
bitmap
Offset to
data
Offset to
data
Field
lengths
2
Encoders
6
“bricks
”
0x0 123 32L 48L 4 “data”
JVM Object
Internal Representation
MyClass(123, “data”, “bricks”)
Encoders translate between
domain objects and Spark's
internal format
Datasets: Lightning-fast Serialization with Encoders
Performance of Core Primitives
cost per row (single thread)
primitive Spark 1.6 Spark 2.0
filter 15 ns 1.1 ns
sum w/o group 14 ns 0.9 ns
sum w/ group 79 ns 10.7 ns
hash join 115 ns 4.0 ns
sort (8 bit entropy) 620 ns 5.3 ns
sort (64 bit entropy) 620 ns 40 ns
sort-merge join 750 ns 700 ns
Intel Haswell i7 4960HQ 2.6GHz, HotSpot 1.8.0_60-b27, Mac OS X
10.11
Workshop: Notebook on
DataFrames/Datasets & Spark SQL
• Import Notebook into your Spark 2.0 Cluster
– http://dbricks.co/sswksh2
– https://spark.apache.org/docs/latest/api/scala/index.ht
ml#org.apache.spark.sql.Dataset
– Work through each cell
• Try challenges
• Break..
Introduction to Structured
Streaming
Streaming in Apache Spark
Streaming demands new types of streaming
requirements…
3
SQL Streamin
g
MLlib
Spark Core
GraphX
Functional, concise and
expressive
Fault-tolerant state
management
Unified stack with batch
processing
More than 51% users say most important part of Apache Spark
Spark Streaming in production jumped to 22% from 14%
Streaming apps are
growing more complex
4
Streaming computations
don’t run in isolation
• Need to interact with batch data,
interactive analysis, machine learning,
etc.
Use case: IoT Device Monitoring
IoT
events
from
Kafka ETL into long term
storage
- Prevent data loss
- Prevent duplicates
Status monitoring
- Handle late data
- Aggregate on
windows on even-t
time
Interactivel
y debug
issues
- consistency
event stream
Anomaly detection
- Learn modelsoffline
- Use online +
continuous learning
Use case: IoT Device Monitoring
Anomaly detection
- Learn modelsoffline
- Use online +
continuous learning
IoT events event stream
from Kafka
ETL into long term storage
- Prevent data loss
Status monitoring - Prevent duplicates Interactively
- Handle late data debug issues
- Aggregate on windows - consistency
on event time
Continuous Applications
Not just streaming any
more
The simplest way to perform streaming analytics
is not having to reason about streaming at all
Static,
bounded
table
Stream as a unbound DataFrame
Streaming,
unbounded table
Single API !
Stream as unbounded DataFrame
Gist of Structured Streaming
High-level streaming API built on Spark SQL engine
Runs the same computation as batch queries in Datasets /
DataFrames
Event time, windowing, sessions, sources & sinks
Guarantees an end-to-end exactly once semantics
Unifies streaming, interactive and batch queries
Aggregate data in a stream, then serve using JDBC
Add, remove, change queries at runtime
Build and apply ML models to your Stream
Advantages over DStreams
1. Processing with event-time, dealing with late data
2. Exactly same API for batch, streaming, and
interactive
3. End-to-end exactly-once guarantees from the
system
4. Performance through SQL optimizations
- Logical plan optimizations, Tungsten, Codegen, etc.
- Faster state management for stateful stream processing
59
Structured Streaming ModelTrigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Quer
y
Input: data from source as an
append-only table
Trigger: how frequently to check
input for new data
Query: operations on input
usual map/filter/reduce
new window, session ops
Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Quer
y
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Result: final operated table
updated every trigger interval
Output: what part of result to write
to data sink after every
trigger
Complete output: Write full result table every
time
Output
complet
e
output
Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Quer
y
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Output
delta
output
Result: final operated table
updated every trigger
interval
Output: what part of result to write
to data sink after every
trigger
Complete output: Write full result table
every time
Delta output: Write only the rows that
changed
in result from previous batch
Example WordCount
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Batch ETL with DataFrame
inputDF = spark.read
.format("json")
.load("source-path")
resultDF = input
.select("device", "signal")
.where("signal > 15")
resultDF.write
.format("parquet")
.save("dest-path")
Read from JSON file
Select some devices
Write to parquet file
Streaming ETL with DataFrame
input = ctxt.read
.format("json")
.stream("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
read…stream() creates a streaming
DataFrame, does not start any of the
computation
write…startStream() defines where
& how to output the data and starts
the processing
Streaming ETL with DataFrame
input = spark.read
.format("json")
.stream("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
1 2 3
Result
[append-only table]
Input
Output
[append mode]
new
rows in
result of
2
new
rows in
result of
3
Continuous Aggregations
Continuously compute average
signal of each type of device
67
input.groupBy("device-type")
.avg("signal")
input.groupBy(
window("event-time",
"10min"),
"device type")
.avg("signal")
Continuously compute average
signal of each type of device in last
10 minutes of event time
- Windowing is just a type of aggregation
- Simple API for event time based
windowing
Joining streams with static data
kafkaDataset = spark.read
.kafka("iot-updates")
.stream()
staticDataset = ctxt .read
. j d b c ( " j d b c : / / " , " i o t - d evi c e - i nfo ")
joinedDataset =
kafkaDataset.join(
staticDataset, "device-type")
2
Join streaming data from Kafka
with static data via JDBC to
enrich the streaming data …
… without having to think that
you are joining streaming data
Output Modes
Defines what is outputted every time there is a
trigger Different output modes make sense for
different queries
2
input.select("device", " s i g n a l " )
. w r i t e
.outputMode("append")
.format("parquet")
.startStream("dest-path")
Append modewith
non-aggregation
queries
input.agg(count("*"))
. w r i t e
.outputMode("complete")
.format("parquet")
.startStream("dest-path")
Complete mode
with aggregation
queries
Query Management
query = result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
query.stop()
query.awaitTermination()
query.exception()
query.sourceStatuses()
query.sinkStatus()
70
query: a handle to the running streaming
computation for managing it
- Stop it, wait for it to terminate
- Get status
- Get error, if terminated
Multiple queries can be active at the same
time
Each query has unique name for keeping
track
Logically:
Dataset operations on table
(i.e. as easy to understand as
batch)
Physically:
Spark automatically runs the
query in streaming fashion
(i.e. incrementally and
continuously)
DataFrame
Logical Plan
Catalyst
optimizer Continuous,
incremental
execution
Query Execution
Batch/Streaming Execution on Spark SQL
72
DataFrame
/
Dataset
Logical
Plan
Planne
r
SQL AST
DataFram
e
Unresolve
d Logical
Plan
Logical
Plan
Optimized
Logical
Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
CatalogDataset
Helluvalot of magic!
Continuous Incremental Execution
Planner knows how to convert
streaming logical plans to a
continuous series of incremental
execution plans
73
DataFrame
/
Dataset
Logical
Plan
Incremental
Execution Plan
1
Incremental
Execution Plan
2
Incremental
Execution Plan
3
Planne
r
Incremental
Execution Plan
4
Structured Streaming: Recap
• High-level streaming API built on Datasets/DataFrames
• Event time, windowing, sessions, sources
& sinks End-to-end exactly once
semantics
• Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve
using JDBC Add, remove, change queries
at runtime
• Build and apply ML models
Demo & Workshop: Structured Streaming
• Import Notebook into your Spark 2.0 Cluster
• http://dbricks.co/sswksh3 (Demo)
• http://dbricks.co/sswksh4 (Workshop)
• Done!
Resources
• docs.databricks.com
• Spark Programming Guide
• Structured Streaming Programming Guide
• Databricks Engineering Blogs
• sparkhub.databricks.com
• https://spark-packages.org/
Q & A ?

More Related Content

What's hot

Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetAnkit Beohar
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 
Python and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementPython and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementLaurent Leturgez
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Edureka!
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Giao trinh-oracle-tieng-viet[bookbooming.com]
Giao trinh-oracle-tieng-viet[bookbooming.com]Giao trinh-oracle-tieng-viet[bookbooming.com]
Giao trinh-oracle-tieng-viet[bookbooming.com]bookbooming1
 
하이브 최적화 방안
하이브 최적화 방안하이브 최적화 방안
하이브 최적화 방안Teddy Choi
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internalsnarsiman
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkTakuya UESHIN
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 

What's hot (20)

Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Python and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementPython and Oracle : allies for best of data management
Python and Oracle : allies for best of data management
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Giao trinh-oracle-tieng-viet[bookbooming.com]
Giao trinh-oracle-tieng-viet[bookbooming.com]Giao trinh-oracle-tieng-viet[bookbooming.com]
Giao trinh-oracle-tieng-viet[bookbooming.com]
 
하이브 최적화 방안
하이브 최적화 방안하이브 최적화 방안
하이브 최적화 방안
 
PySaprk
PySaprkPySaprk
PySaprk
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internals
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 

Viewers also liked

Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRDatabricks
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics Databricks
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure ExecutionDatabricks
 
Spark Summit Europe 2016 Keynote - Databricks CEO
Spark Summit Europe 2016 Keynote  - Databricks CEO Spark Summit Europe 2016 Keynote  - Databricks CEO
Spark Summit Europe 2016 Keynote - Databricks CEO Databricks
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
Insights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured StreamingInsights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured StreamingDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Introducing apache prediction io (incubating) (bay area spark meetup at sales...
Introducing apache prediction io (incubating) (bay area spark meetup at sales...Introducing apache prediction io (incubating) (bay area spark meetup at sales...
Introducing apache prediction io (incubating) (bay area spark meetup at sales...Databricks
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesDatabricks
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
 

Viewers also liked (20)

Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
 
Spark Summit Europe 2016 Keynote - Databricks CEO
Spark Summit Europe 2016 Keynote  - Databricks CEO Spark Summit Europe 2016 Keynote  - Databricks CEO
Spark Summit Europe 2016 Keynote - Databricks CEO
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Insights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured StreamingInsights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured Streaming
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Introducing apache prediction io (incubating) (bay area spark meetup at sales...
Introducing apache prediction io (incubating) (bay area spark meetup at sales...Introducing apache prediction io (incubating) (bay area spark meetup at sales...
Introducing apache prediction io (incubating) (bay area spark meetup at sales...
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 

Similar to Jump Start with Apache Spark 2.0 on Databricks

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connectorDenny Lee
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 

Similar to Jump Start with Apache Spark 2.0 on Databricks (20)

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 

Recently uploaded (20)

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 

Jump Start with Apache Spark 2.0 on Databricks

  • 1. Jump Start with Apache® Spark™ 2.0 on Databricks Jules S. Damji Spark Community Evangelist Fremont Big Data & Cloud Meetup @2twitme
  • 2. $ whoami • Spark Community Evangelist @ Databricks • Previously Developer Advocate @ Hortonworks • In the past engineering roles at: • Sun Microsystems, Netscape, @Home,VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest • jules@databricks.com • https://www.linkedin.com/in/dmatrix
  • 3. Agenda for the next 2+ hours • Get to know Databricks • Overview Spark Architecture • What’s New in Spark 2.0 • Unified APIs: SparkSessions, SQL, DataFrames, Datasets… • Workshop Notebook 1 • Break… Hour 1 • Introduction to DataFrames, DataSets and Spark SQL • Workshop Notebook 2 • Introduction to Structured Streaming Concepts • Workshop 3/Demo • Go Home… Hour 2+
  • 4. Get to know Databricks • Get Databricks community edition http://databricks.com/try- databricks
  • 5. We are Databricks, the company behind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 5 Data Value Created Databricks on top of Spark to make big data simple.
  • 6. Unified engine across diverse workloads & environments
  • 7. Apache Spark Architecture Deployments Modes • Local • Standalone • YARN • Mesos
  • 8. Apache Spark Architecture An Anatomy of an Application Spark Application • Jobs • Stages • Tasks
  • 9. How did we Get Here..? Where we Going..?
  • 10. A Brief History 10 2012 Started @ UC Berkeley 2010 research paper 2013 Databricks started & donated to ASF 2014 Spark 1.0 & libraries (SQL, ML, GraphX) 2015 DataFrames/Datasets Tungsten ML Pipelines 2016 Apache Spark 2.0 Easier Smart er Faster
  • 11. Apache Spark 2.0 • Steps to Bigger & Better Things…. Builds on all we learned in past 2 years
  • 12.
  • 13.
  • 14. Major Features in Apache Spark 2.0 Tungsten Phase 2 speedups of 5-10x & Catalyst Optimizer Faster Structured Streaming real-time engine on SQL / DataFrames Smarter Unifying Datasets and DataFrames & SparkSessions Easier
  • 15. Unified API Foundation for the Future: Spark Sessions, Dataset, DataFrame, MLlib, Structured Streaming…
  • 16. Towards SQL 2003 • Today, Spark can run all 99 TPC-DS queries! - New standard compliant parser (with good error messages!) - Subqueries (correlated & uncorrelated) - Approximate aggregate stats - https://databricks.com/blog/2016/07/26/introducing- apache-spark-2-0.html - https://databricks.com/blog/2016/06/17/sql-subqueries-in- apache-spark-2-0.html
  • 17. 0 100 200 300 400 500 600 Runtime(seconds) Preliminary TPC-DS Spark 2.0 vs 1.6 – Lower is Better Time (1.6) Time (2.0)
  • 18. SparkSession – A Unified entry point to Spark • SparkSession is the “SparkContext” for Dataset/DataFrame - Entry point for reading data and writing data - Working with metadata - Setting Spark Configuration - Driver uses for Cluster resource management
  • 19. SparkSession vs SparkContext SparkSessions Subsumes • SparkContext • SQLContext • HiveContext • StreamingContext • SparkConf
  • 21. Long Term • RDD will remain the low-level API in Spark • For control and certain type-safety in Java/Scala • Datasets & DataFrames give richer semantics and optimizations • For semi-structured data and DSL like operations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming, MLlib, GraphFrames – A Tale of Three APIs: RDDs, DataFrames and Datasets
  • 22. Other notable API improvements • DataFrame-based ML pipeline API becoming the main MLlib API • ML model & pipeline persistence with almost complete coverage • In all programming languages: Scala, Java, Python, R • Improved R support • (Parallelizable) User-defined functions in R • Generalized Linear Models (GLMs), Naïve Bayes, Survival Regression, K-Means
  • 23. Workshop: Notebook on SparkSession • Import Notebook into your Spark 2.0 Cluster – http://dbricks.co/sswksh1 – http://docs.databricks.com – http://spark.apache.org/docs/latest/api/scala/index.html#or g.apache.spark.sql.SparkSession • Familiarize your self with Databricks Notebook environment • Work through each cell • CNTR + <return> / Shift + Return • Try challenges • Break…
  • 24. DataFrames/Datasets & Spark SQL & Catalyst Optimizer
  • 25. The not so secret truth… SQL is not about SQL is about more than SQL
  • 26. Spark SQL: The whole story 1 0 Is About Creating and Running Spark Programs Faster: • Write less code • Read less data • Let the optimizer do the hard work
  • 27. Spark SQL Architecture Logical Plan Physica l Plan Catalog Optimizer RDDs … Data Source API SQL DataFrame s Code Generator Datasets
  • 28. 28 Using Catalyst in Spark SQL Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzing a logical plan to resolve references Logical Optimization: logical plan optimization Physical Planning: Physical planning Code Generation: Compile parts of the query to Java bytecode SQL AST DataFrame Datasets
  • 29. Catalyst Optimizations Logical Optimizations Create Physical Plan & generate JVM bytecode • Push filter predicates down to data source, so irrelevant data can be skipped • Parquet: skip entire blocks, turn comparisons on strings into cheaper integer comparisons via dictionary encoding • RDBMS: reduce amount of data traffic by pushing predicates down • Catalyst compiles operations into physical plans for execution and generates JVM bytecode • Intelligently choose between broadcast joins and shuffle joins to reduce network traffic • Lower level optimizations: eliminate expensive object allocations and reduce virtual function calls
  • 30. # Load partitioned Hive table def add_demographics(events): u = sqlCtx.table("users") events . j o i n ( u , events.user_id == u.user_id) .withColumn("city", zipToCity(df.zip)) # Join on user_id # Run udf to add c i t y column Physical Plan with Predicate Pushdown and Column Pruning join optimize d scan (events) optimize d scan (users) events = add_demographics(sqlCtx.load("/data/events", "parquet")) training_data = events.where(events.city == "New York").select(events.timestamp).collect() Logical Plan filter join Physical Plan join scan (users ) events file users table 30 scan (event s) filter
  • 31. Columns: Predicate pushdown spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "people") .load() .where($"name" === "michael") 31 You Write Spark Translates For Postgres SELECT * FROM people WHERE name = 'michael'
  • 32. 4 3 Spark Core (RDD) Catalyst DataFrame/DatasetSQL ML Pipelines Structure d Streamin g { JSON } JDBC and more… Foundational Spark 2.0 Components Spark SQL GraphFrame s
  • 35. Background: What is in an RDD? •Dependencies • Partitions (with optional locality info) • Compute function: Partition => Iterator[T] Opaque Computation & Opaque Data
  • 36. Structured APIs In Spark 36 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime Analysis errors are reported before a distributed job starts
  • 37. Dataset API in Spark 2.0 • Typed interface over DataFrames / Tungsten • case class Person(name: String, age: Int) • val dataframe = spark.read.json(“people.json”) • val ds: Dataset[Person] = dataframe.as[Person] • ds.filter(p => p.name.startsWith(“M”)) .groupBy(“name”) .avg(“age”)
  • 38. Type-safe: operate on domain objects with compiled lambda functions 8 Datasets API val df = spark.read.json("people.json") / / Convert data to domain objects. case class Person(name: S tri ng , age: I n t ) val ds: Dataset[Person] = df.as[Person] d s . f i l t e r ( _ . a g e > 30) / / Compute histogram of age by name. val h i s t = ds.groupBy(_.name).mapGroups { case (name, people: Iter[Person]) => val buckets = newArray[Int](10) people.map(_.age).foreach { a => buckets(a / 10) += 1 } (name, buckets) }
  • 41. Project Tungsten • Substantially speed up execution by optimizing CPU efficiency, via: SPARK-12795 (1) Runtime code generation (2) Exploiting cache locality (3) Off-heap memory management
  • 42. 6 “bricks ” Tungsten’s Compact Row Format 0x 0 12 3 32 L 48 L 4 “data” (123, “data”, “bricks”) Null bitmap Offset to data Offset to data Field lengths 2
  • 43. Encoders 6 “bricks ” 0x0 123 32L 48L 4 “data” JVM Object Internal Representation MyClass(123, “data”, “bricks”) Encoders translate between domain objects and Spark's internal format
  • 45. Performance of Core Primitives cost per row (single thread) primitive Spark 1.6 Spark 2.0 filter 15 ns 1.1 ns sum w/o group 14 ns 0.9 ns sum w/ group 79 ns 10.7 ns hash join 115 ns 4.0 ns sort (8 bit entropy) 620 ns 5.3 ns sort (64 bit entropy) 620 ns 40 ns sort-merge join 750 ns 700 ns Intel Haswell i7 4960HQ 2.6GHz, HotSpot 1.8.0_60-b27, Mac OS X 10.11
  • 46.
  • 47. Workshop: Notebook on DataFrames/Datasets & Spark SQL • Import Notebook into your Spark 2.0 Cluster – http://dbricks.co/sswksh2 – https://spark.apache.org/docs/latest/api/scala/index.ht ml#org.apache.spark.sql.Dataset – Work through each cell • Try challenges • Break..
  • 49. Streaming in Apache Spark Streaming demands new types of streaming requirements… 3 SQL Streamin g MLlib Spark Core GraphX Functional, concise and expressive Fault-tolerant state management Unified stack with batch processing More than 51% users say most important part of Apache Spark Spark Streaming in production jumped to 22% from 14%
  • 50. Streaming apps are growing more complex 4
  • 51. Streaming computations don’t run in isolation • Need to interact with batch data, interactive analysis, machine learning, etc.
  • 52. Use case: IoT Device Monitoring IoT events from Kafka ETL into long term storage - Prevent data loss - Prevent duplicates Status monitoring - Handle late data - Aggregate on windows on even-t time Interactivel y debug issues - consistency event stream Anomaly detection - Learn modelsoffline - Use online + continuous learning
  • 53. Use case: IoT Device Monitoring Anomaly detection - Learn modelsoffline - Use online + continuous learning IoT events event stream from Kafka ETL into long term storage - Prevent data loss Status monitoring - Prevent duplicates Interactively - Handle late data debug issues - Aggregate on windows - consistency on event time Continuous Applications Not just streaming any more
  • 54.
  • 55. The simplest way to perform streaming analytics is not having to reason about streaming at all
  • 56. Static, bounded table Stream as a unbound DataFrame Streaming, unbounded table Single API !
  • 57. Stream as unbounded DataFrame
  • 58. Gist of Structured Streaming High-level streaming API built on Spark SQL engine Runs the same computation as batch queries in Datasets / DataFrames Event time, windowing, sessions, sources & sinks Guarantees an end-to-end exactly once semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove, change queries at runtime Build and apply ML models to your Stream
  • 59. Advantages over DStreams 1. Processing with event-time, dealing with late data 2. Exactly same API for batch, streaming, and interactive 3. End-to-end exactly-once guarantees from the system 4. Performance through SQL optimizations - Logical plan optimizations, Tungsten, Codegen, etc. - Faster state management for stateful stream processing 59
  • 60. Structured Streaming ModelTrigger: every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Quer y Input: data from source as an append-only table Trigger: how frequently to check input for new data Query: operations on input usual map/filter/reduce new window, session ops
  • 61. Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Quer y Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Result: final operated table updated every trigger interval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output complet e output
  • 62. Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Quer y Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Output delta output Result: final operated table updated every trigger interval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Delta output: Write only the rows that changed in result from previous batch
  • 64. Batch ETL with DataFrame inputDF = spark.read .format("json") .load("source-path") resultDF = input .select("device", "signal") .where("signal > 15") resultDF.write .format("parquet") .save("dest-path") Read from JSON file Select some devices Write to parquet file
  • 65. Streaming ETL with DataFrame input = ctxt.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .outputMode("append") .startStream("dest-path") read…stream() creates a streaming DataFrame, does not start any of the computation write…startStream() defines where & how to output the data and starts the processing
  • 66. Streaming ETL with DataFrame input = spark.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .outputMode("append") .startStream("dest-path") 1 2 3 Result [append-only table] Input Output [append mode] new rows in result of 2 new rows in result of 3
  • 67. Continuous Aggregations Continuously compute average signal of each type of device 67 input.groupBy("device-type") .avg("signal") input.groupBy( window("event-time", "10min"), "device type") .avg("signal") Continuously compute average signal of each type of device in last 10 minutes of event time - Windowing is just a type of aggregation - Simple API for event time based windowing
  • 68. Joining streams with static data kafkaDataset = spark.read .kafka("iot-updates") .stream() staticDataset = ctxt .read . j d b c ( " j d b c : / / " , " i o t - d evi c e - i nfo ") joinedDataset = kafkaDataset.join( staticDataset, "device-type") 2 Join streaming data from Kafka with static data via JDBC to enrich the streaming data … … without having to think that you are joining streaming data
  • 69. Output Modes Defines what is outputted every time there is a trigger Different output modes make sense for different queries 2 input.select("device", " s i g n a l " ) . w r i t e .outputMode("append") .format("parquet") .startStream("dest-path") Append modewith non-aggregation queries input.agg(count("*")) . w r i t e .outputMode("complete") .format("parquet") .startStream("dest-path") Complete mode with aggregation queries
  • 70. Query Management query = result.write .format("parquet") .outputMode("append") .startStream("dest-path") query.stop() query.awaitTermination() query.exception() query.sourceStatuses() query.sinkStatus() 70 query: a handle to the running streaming computation for managing it - Stop it, wait for it to terminate - Get status - Get error, if terminated Multiple queries can be active at the same time Each query has unique name for keeping track
  • 71. Logically: Dataset operations on table (i.e. as easy to understand as batch) Physically: Spark automatically runs the query in streaming fashion (i.e. incrementally and continuously) DataFrame Logical Plan Catalyst optimizer Continuous, incremental execution Query Execution
  • 72. Batch/Streaming Execution on Spark SQL 72 DataFrame / Dataset Logical Plan Planne r SQL AST DataFram e Unresolve d Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation CatalogDataset Helluvalot of magic!
  • 73. Continuous Incremental Execution Planner knows how to convert streaming logical plans to a continuous series of incremental execution plans 73 DataFrame / Dataset Logical Plan Incremental Execution Plan 1 Incremental Execution Plan 2 Incremental Execution Plan 3 Planne r Incremental Execution Plan 4
  • 74. Structured Streaming: Recap • High-level streaming API built on Datasets/DataFrames • Event time, windowing, sessions, sources & sinks End-to-end exactly once semantics • Unifies streaming, interactive and batch queries • Aggregate data in a stream, then serve using JDBC Add, remove, change queries at runtime • Build and apply ML models
  • 75.
  • 76. Demo & Workshop: Structured Streaming • Import Notebook into your Spark 2.0 Cluster • http://dbricks.co/sswksh3 (Demo) • http://dbricks.co/sswksh4 (Workshop) • Done!
  • 77. Resources • docs.databricks.com • Spark Programming Guide • Structured Streaming Programming Guide • Databricks Engineering Blogs • sparkhub.databricks.com • https://spark-packages.org/
  • 78. Q & A ?