In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
2. About Databricks
Founded by creatorsof Spark in 2013
Cloud enterprisedata platform
- Managed Spark clusters
- Interactive data science
- Production pipelines
- Data governance,security, …
3. What is Apache Spark?
Unified engineacross data workloads and platforms
…
SQLStreaming ML Graph Batch …
7. Spark 2.0
Steps to bigger& better things….
Builds on all we learned in past 2 years
8. Versioning in Spark
In reality, we hate breaking APIs!
Will notdo so exceptfor dependency conflicts(e.g.Guava) and experimental APIs
1.6.0
Patch version (only bug fixes)
Major version (may change APIs)
Minor version (addsAPIs/ features)
9. Major Features in 2.0
TungstenPhase 2
speedupsof 5-20x
StructuredStreaming SQL 2003
& Unifying Datasets
and DataFrames
11. Towards SQL 2003
As of this week, Spark branch-2.0 can run all 99 TPC-DS queries!
- New standard compliant parser(with good errormessages!)
- Subqueries(correlated& uncorrelated)
- Approximate aggregatestats
12. Datasets and DataFrames
In 2015, we added DataFrames & Datasets as structured data APIs
• DataFrames are collections of rows with a schema
• Datasets add static types,e.g. Dataset[Person]
• Both run on Tungsten
Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]
13. SparkSession – a new entry point
SparkSessionis the “SparkContext”for Dataset/DataFrame
- Entry point for reading data
- Working with metadata
- Configuration
- Clusterresourcemanagement
15. Long-Term
RDD will remain the low-levelAPIin Spark
Datasets & DataFrames give richer semanticsand optimizations
• New libraries will increasingly use these as interchange format
• Examples: Structured Streaming,MLlib, GraphFrames
16. Other notable API improvements
DataFrame-based ML pipeline API becoming the main MLlib API
ML model & pipeline persistencewith almost complete coverage
• In all programming languages:Scala, Java, Python,R
Improved R support
• (Parallelizable) User-defined functionsin R
• Generalized LinearModels(GLMs), Naïve Bayes,Survival Regression,K-Means
18. Background
Real-time processingis vital for streaming analytics
Apps needa combination: batch & interactive queries
• Trackstate using a stream, then run SQL queries
• Train an ML model offline, then update it
20. Processing
Businesslogic change & new ops
(windows,sessions)
Complex Programming Models
Output
How do we define
outputover time & correctness?
Data
Late arrival, varying distribution overtime, …
21. The simplest way to perform streaming analytics
is not having to reason about streaming.
25. Structured Streaming
High-levelstreaming APIbuilt on Spark SQL engine
• Declarative API that extendsDataFrames / Datasets
• Eventtime, windowing,sessions,sources& sinks
Support interactive & batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queriesatruntime
• Build and apply ML models Not just streaming, but
“continuous applications”
26. Goal: end-to-end continuous applications
Example
Reporting Applications
ML Model
Ad-hoc Queries
Traditionalstreaming
Other processingtypes
Kafka DatabaseETL
29. Going back to the fundamentals
Difficult to getorder of magnitude performancespeed ups with
profiling techniques
• For 10ximprovement,would need to find top hotspots that add up to 90%
and make theminstantaneous
• For 100x,99%
Instead, lookbottom up, how fast should it run?
31. Volcano Iterator Model
Standard for 30 years: almost
all databases do it
Each operatoris an “iterator”
that consumes recordsfrom
its input operator
class Filter {
def next(): Boolean = {
var found = false
while (!found && child.next()) {
found = predicate(child.fetch())
}
return found
}
def fetch(): InternalRow = {
child.fetch()
}
…
}
32. What if we hire a collegefreshmanto
implement this queryin Java in 10 mins?
select count(*) from store_sales
where ss_item_sk = 1000
var count = 0
for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {
count += 1
}
}
33. Volcano model
30+ years of database research
college freshman
hand-written code in 10 mins
vs
35. How does a student beat 30 years of research?
Volcano
1. Many virtual function calls
2. Data in memory (orcache)
3. No loop unrolling,SIMD, pipelining
hand-written code
1. No virtual function calls
2. Data in CPU registers
3. Compilerloop unrolling,SIMD,
pipelining
Take advantage of all the information that is known after query compilation
36. Scan
Filter
Project
Aggregate
long count = 0;
for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {
count += 1;
}
}
Tungsten Phase 2: Spark as a “Compiler”
Functionality of a generalpurpose
execution engine; performanceas if
hand built system just to run your query
37. Performance of Core Primitives
cost per row (single thread)
primitive Spark 1.6 Spark 2.0
filter 15 ns 1.1 ns
sum w/o group 14 ns 0.9 ns
sum w/ group 79 ns 10.7 ns
hash join 115 ns 4.0 ns
sort (8 bit entropy) 620 ns 5.3 ns
sort (64 bit entropy) 620 ns 40 ns
sort-merge join 750 ns 700 ns
Intel Haswell i7 4960HQ 2.6GHz, HotSpot 1.8.0_60-b27, Mac OS X 10.11
42. Today’s talk
Spark 2.0 doubles down on what made Spark attractive:
• Faster: Project Tungsten Phase 2, i.e. “Spark as a compiler”
• Easier: unified APIs& SQL 2003
• Smarter: Structured Streaming
• Only scratched the surface here, as Spark 2.0 will resolve ~ 2000 tickets.
Learn Spark on Databricks Community Edition
• join beta waitlist https://databricks.com/ce/