Apache Spark 2.0 offers many enhancements that make continuous analytics quite simple. In this talk, we will discuss many other things that you can do with your Apache Spark cluster. We explain how a deep integration of Apache Spark 2.0 and in-memory databases can bring you the best of both worlds! In particular, we discuss how to manage mutable data in Apache Spark, run consistent transactions at the same speed as state-the-art in-memory grids, build and use indexes for point lookups, and run 100x more analytics queries at in-memory speeds. No need to bridge multiple products or manage, tune multiple clusters. We explain how one can take regulation Apache Spark SQL OLAP workloads and speed them up by up to 20x using optimizations in SnappyData.
We then walk through several use-case examples, including IoT scenarios, where one has to ingest streams from many sources, cleanse it, manage the deluge by pre-aggregating and tracking metrics per minute, store all recent data in a in-memory store along with history in a data lake and permit interactive analytic queries at this constantly growing data. Rather than stitching together multiple clusters as proposed in Lambda, we walk through a design where everything is achieved in a single, horizontally scalable Apache Spark 2.0 cluster. A design that is simpler, a lot more efficient, and let’s you do everything from Machine Learning and Data Science to Transactions and Visual Analytics all in one single cluster.
6. Number of Processing Challenges
6
Perpetually
Growing
…
Maintain
summaries
Reference data
-- External systems
Time Window
Aggregated
Summary
Stream Sink
Table
Ref Tables
Stream
Process
JOIN – streams,
large table,
reference data
- Window could be
large (an hour)
sliding every 2
seconds
- Fast Writes
- Updates
- Exactly once
semantics
- HA
JOIN
7. Number of Processing Challenges
7
Time Window
Aggregated
Summary
Stream Sink
Ref Tables
Stream
Process
INTERACTIVE
QUERIES (ad-hoc)
- High concurrency
- Point lookup,
scan/aggregations
8. Why Supporting Mixed Workloads is Difficult?
Data Structures
Query Processing
Paradigm
Scheduling &
Provisioning
Columnar
Batch
Processing
Long-running
Row stores
Point
Lookups
Short-lived
Sketches
Delta /
Incremental
Bursty
14. Our Solution – FUSE Spark with In-memory database
14
Deep Scale,
High Volume
MPP DB
Real-time design
Low latency, HA,
concurrencyBatch design, high
throughput, Rich API,
Eco-system
Maturedover 13 years
Single Unified HA Cluster
OLTP+ OLAP + StreamingforReal-timeAnalytics
15. • Cannot update
• Repeated for each
User/App
USER 1/APP1
Spark
Master
Spark Executor (Worker)
Framework for
streaming
SQL, ML…
Immutable
CACHE
USER 2/APP2
Spark
Master
Spark Executor (Worker)
Framework for
streaming
SQL, ML…
Immutable
CACHE
HDFS
SQL
NoSQL
Bottleneck
We Transform Spark from a computational engine …
15
16. … Into an “Always-On” Hybrid Database !
16
Deep Scale,
High Volume
MPP DB
HDFS
SQL
NoSQL
HISTORY
Spark Executor (Worker)JVM
-Long running
Spark Runtime
Stream
process,
SQL,
ML…
Spark
Driver
In-Memory
ROW + COLUMN
In-memory
Indexes
Store
- Mutable,
- TransactionalSpark
Cluster
JDBC
ODBC
SparkJob
Shared Nothing
Persistence
17. … Into an “Always-On” Hybrid Database !
17
Spark API
(Streaming, ML, Graph)
Transactions
, Indexing
Full SQL HA
DataFrame,
RDD, DataSets
RowsColumnar
IN-MEMORY
Spark Cache
Synopses
(Samples)
Unified Data Access
(Virtual Tables)
Unified CatalogNative Store
SNAPPYDATA
HDFS/HBAS
E
S3
JSON, CSV,
XML
SQL db Cassandra MPP DB
Stream
sources
Spark Jobs, Scala/Java/Python/R API, JDBC/ODBC, Object API (RDD, DataSets)
18. Overview
18
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
TRX
Data Synopsis Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
Tables ODBC/JDBC
19. Unified Data Model & API
19
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
TRX
Data Synopsis Engine
Distributed Membership
Service
H
A
Stream Processing
Tables ODBC/JDBCData Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
• Mutability(DML+Trx)
• Indexing
• SQL-basedstreaming
20. Hybrid Store
20
Unbounded
Streams Ingestion
Real time
Sampling
Transactional
State Update
Probabilistic
IndexRows
Row
Buffer
Columns
Random Writes
( Reference data )
OLAP
Stream Analytics
Row table
Column table
Sample table
21. Simple API and Spark Compatible
21
// Use the Spark DATA SOURCE API
val snappy = new org.apache.spark.sql.SnappySession(spark.sparkContext)
snappy.createTable(“TableName”, “Row | Column | Sample”, schema, options )
someDataFrame.write.insertInto(“TableName”)
// Update, Delete, Put using the SnappySession
snappy.update(“tableName”, filterExpr, Row(<newColumnValues>), updatedColumns )
snappy.delete(“tableName”, filterExpr)
// Or, just Use Spark SQL syntax ..
Snappy.sql(“select x from tableName”).count
// Or, JDBC, ODBC to access like a regular Database
jdbcStatement.executeUpdate(“insert into tableName values …”)
25. Can we use Statistical techniques to shrink data?
25
• Most apps happy to tradeoff 1% accuracy for
200x speedup!
• Can usually get a 99.9% accurate answer by only
looking at a tiny fraction of data!
• Often can make perfectly accurate decisions
with imperfect answers!
• A/B Testing, visualization, ...
• The data itself is usually noisy
• Processing entire data doesn’t necessarily mean exact
answers!
26. `
Probabilistic Store: Sketches + Uniform & Stratified Samples
Higher resolution for more recent
time ranges
1. Streaming CMS(Count-Min-Sketch)
[t1, t2) [t2, t3) [t3, t4) [t4, now) Time
4T 2T T ≤T
....
Maintain a small sample at each CMS cell
2. Top-K Queries w/ArbitraryFilters
Traditional CMS CMS+Samples
3. Fully Distributed Stratified Samples
Always include timestamp as a stratified column
for streams
Streams
AgingRow Store (In-memory) Column Store (Disk)
timestamp
27. Overview
27
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
Transact
Data Synopsis Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
Tables ODBC/JDBC
28. Supporting Real-time & HA
28
Locator
Lead Node
Executor JVM (Server)
Shared Nothing
Persistence
JDBC/
ODBC
Catalog Service
Managed Driver
SPARK
Contacts
SPARK
Context
SNAPPY
Cluster
Manager
REST
SPARK JOBS
SPARK
Program
Memory Mgmt
BLOCKS SNAPPY STORE
Stream SNAPPY
Tables
Tables
DataFrame
• Spark Executors are
long running. Driver
failure doesn’t shutdown
Executors
• Driver HA – Drivers are
“Managed” by
SnappyData with standby
secondary
• Data HA – Consensus
based clustering
integrated for eager
replication
DataFrame
Peer-2-peer
29. Overview
29
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
Transact
Data Synopsis Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
Tables ODBC/JDBC
30. Transactions
30
• Support for Read Committed & Repeatable Read
• W-W and R-W conflict detection at write time
• MVCC for non blocking reads and snapshot isolation
• Distributed system failure detection integrated with commit protocol
- Evict unresponsive replicas
- Ensure consistency when replicas recover
31. Overview
31
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
Transact
Data Synopsis Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
Tables ODBC/JDBC
32. Query Optimization
• Bypass the scheduler for transactions and low-latency jobs
• Minimize shuffles aggressively
- Dynamic replication for reference data
- Retain ‘join indexes’ whenever possible
- Collocate related data sets
• Optimized ‘Hash Join’, ‘Scan’, ‘GroupBy’ compared to Spark
- Uses more variables in code generation, vectorized structures
• Column segment pruning through statistics
34. Overview
34
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
Transact
Data Synopsis Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
Tables ODBC/JDBC
35. Approximate Query Processing (AQP): Academia vs. Industry
35
25+yrsofsuccessful research in academia User-facing AQP almost
non-existent in commercial world!
Some approximate features in Infobright, Yahoo’s
Druid, Facebook’s Presto, Oracle 12C, ...
AQUA, Online Aggregation, MapReduce
Online, STRAT, ABS, BlinkDB / G-OLA,
...
WHY ?
BUT:
select geo, avg(bid)
from adImpressions
group by geo having
avg(bid)>10
with error 0.05 at confidence 95
geo avg(bid) error prob_existence
MI 21.5 ± 0.4 0.99
CA 18.3 ± 5.1 0.80
MA 15.6 ± 2.4 0.81
... ... ... ....
1. Incompatible w/ BItools
2. Complex semantics
3. Bad sales pitch!
36. A First Industrial-Grade AQP Engine
1. Highlevel Accuracy Contract (HAC)
• Concurrency: 10’s of queries in shared clusters
• Resource usage: everyone hates their AWS bill
• Network shuffles
• Immediate results while waiting for final results
2. Fully compatible w/BItools
• Set HAC behavior at JDBC/ODBC connection level
3. Better marketing!
• User picks a single number p, where 0≤p≤1 (by
default p=0.95)
• Snappy guarantees that s/he only sees things that
are at least p% accurate
• Snappy handles (and hides) everything else!
geo avg(bid)
MI 21.5
WI 42.3
NY 65.6
... ...
iSight (Immediate
inSight)
38. Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.
○ CPU contention drops
● Far less complex
○ single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
● Much faster
○ compressed data managed in distributed memory in columnar
form reduces volume and is much more responsive
39. Lessons Learned
2. A unified cluster is simpler, cheaper, andfaster
- By sharing state across apps, we decouple apps from data servers and provide HA
- Save memory, data copying, serialization, and shuffles
- Co-partitioning and co-location for faster joins and stream analytics
3. Advantages over HTAP engines: Deep stream integration +AQP
1. A unique experience marryingstwo different breeds ofdistributed systems
lineage-based for high-throughput vs. (consensus-) replication-based for low-latency
- Stream processing ≠ stream analytics
- Top-k w/ almost arbitrary predicates + 1-pass stratified sampling over streams
4. Commercializing academic workis lots ofwork but alsolots offun
40. THANK YOU !
Try our iSight cloud for free:
http://snappydata.io/iSight
42. Our Solution:
Highlevel Accuracy Contract (HAC)
• A single number 0≤p≤1 (by default p=0.95)
• We guarantee that you only see things that
are at least p% accurate
• We handle (and hide) everything else
– Choose a behavior: REPLACE WITH SPECIAL
SYMBOL (default), DO NOTHING, DROP THE ROW)
Editor's Notes
Many subscribers, lots of 2G/3G/4G voice/data
Network events: location events, CDRs, network issues
You have stream that represents the current window, a large table that is perpetually growing, maybe the app is reducing data through continuous aggregations(apriori known summaries) and many smaller reference data tables. The ref data is sourced from other systems like say CRM and are likely to be changing in real time as well.
Challenges:
- Stream analytics: Joins between streams, history, reference data for correlations, trends, pattern matching.
- Very fast writes, including updates. Potentially transactional semantics. Of course, you need exactly-once guarantees.
- Interactive queries: these tend to be ad-hoc in nature. Could be point lookups or aggregation/scan oriented queries. Only some can be satisfied using pre-fabricated aggregation tables. OLAP cube maintenance in streaming world is very difficult.
- Many users could be interacting. So, concurrency is important.
Challenges:
- Stream analytics: Joins between streams, history, reference data for correlations, trends, pattern matching.
- Very fast writes, including updates. Potentially transactional semantics. Of course, you need exactly-once guarantees.
- Interactive queries: these tend to be ad-hoc in nature. Could be point lookups or aggregation/scan oriented queries. Only some can be satisfied using pre-fabricated aggregation tables. OLAP cube maintenance in streaming world is very difficult.
- Many users could be interacting. So, concurrency is important.
when mentioning sketches say “infinite streams”
Many purpose built products
To derive meaningful insight Apps have to stitch products
According to the latest report, 75% of enterprises believe real time ranges from important to critical for business success
Why they need realtime: 52% said for delivering real-time dashboards and improving customer experience
Define real-time: 81% of enterprises define real time as either second or sub-second speeds
Snappy store and Spark Executor share the same process space and JVM memory
Reference based access
– zero copy
Snappy store and Spark Executor share the same process space and JVM memory
Reference based access
– zero copy
Contributions: sharing state across many clients and applications to minimizes (de- )serialization §5.4,
providing high-availability through low-latency failure detection and decoupling applications from data servers §6.1,
bypassing the scheduler to interleave fine-grained operations with long-running jobs §6.3,
ensuring transactional consistency §6.4.
share the same process space and memory pool between Spark and Geode §5.3,
co-partition tables and streams to minimize data shuffling §5.5.
For unified API say:
Spark’s DataFrame API allows for: ML, graph, batch & streaming, SQL (selects)
SnappyData adds full SQL support and extends DataFrame and DataSource APIs for:
Mutability semantics (DML & transactions)
Indexing
SQL-based streaming
No changes to the BI tool; no need to be a statistician; and there’s still an API to ask all sorts of detailed error information if you’re an advanved user