SnappyData is a new open source project that extends Apache Spark to enable real-time operational analytics by combining OLTP, OLAP, and stream processing capabilities. It provides a unified cluster that allows storing data in-memory for fast queries and updates while maintaining high availability. SnappyData integrates Spark and GemFire technologies to offer low-latency queries, updates and analytics on fast moving streaming and batch data.
3. IoT is what makes the big data challenge very real
A 10 Trillion Device World1
www.snappydata.io
1:http://cacm.acm.org/news/191847-get-ready-to-live-in-a-trillion-device-world/fulltext
4. Because Insights are like people. Useful for a short period of time
The New Arms Race
www.snappydata.io
● Sift through data to get insights
to improve your business
● What is your time to insights?
● What is your time to
operationalizing insights?
5. Can we use the past to accurately predict the future?
The Holy Grail of Analytics
www.snappydata.io
6. The faster you go, the bigger your business advantage
Speeding Up Insights
www.snappydata.io
7. Exploding data volumes fuel the search for distributed solutions
How We Got Here
www.snappydata.io
Teradata
Cognos
GreenPlum
Netezza,
ParAccel
Hadoop
(SQL on
Hadoop)
Spark
(Spark
SQL)
8. Every enterprise today deals with these 4 kinds of data interactions
The Four Horsemen Of Data
www.snappydata.io
OLTP OLAP Streaming Machine
Learning
9. Who Are We?
● An EMC-Pivotal spinout focused on real time operational
analytics
● New Spark-based open source project started by Pivotal
GemFire founders+engineers
● Decades of in-memory data management experience
● Focus on real-time, operational analytics: Spark inside an
OLTP+OLAP database
www.snappydata.io
10. SnappyData At Cruising Altitude
Single unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real time operational Analytics – TBs in memory
RDB
Rows
Txn
Columnar
API
Stream processing
ODBC,
JDBC, REST
Spark -
Scala, Java,
Python, R
HDFS
AQP
First commercial project on Approximate
Query Processing(AQP)
MPP DB
Index
11. SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP + Stream
for real-time analytics
Batch design, high throughput
Real-‐time
design
center
-‐
Low
latency,
HA,
concurrent
Vision: Drastically reduce the cost and
complexity in modern big data
12. Huge community adoption, slip streaming into Hadoop momentum, great data integration platform
Why Spark?
• Most events in life can be analyzed as micro batches
• Blends streaming, interactive, and batch analytics
• Appeals to Java, R, Python, Scala programmers
• Rich set of transformations and libraries
• RDD and fault tolerance without replication
• Offers Spark SQL as a key capability
www.snappydata.io
13. Spark is a compute framework that processes data, not an analytics database
Clearing Up Some Spark Myths
www.snappydata.io
● It is NOT a distributed in-memory database
○ It’s a computational framework with immutable caching
● It is NOT Highly Available
○ Fault tolerance is not the same as HA
● NOT well suited for real time, operational environments
○ Does not handle concurrency well
○ Does not share data very well either
15. Perspective on Lambda for real time
In-Memory DB
Interactive queries,
updates
Deep Scale, High
volume
MPP DB
Transform
Data-in-motion
Analytics
Application
Streams
Alerts
24. Table can be partitioned or replicated
Replicated
Table
Partitioned
Table
(Buckets A-H) Replicated
Table
Partitioned
Table
(Buckets I-P)
consistent replica on each node
Partition
Replica
(Buckets A-H)
Replicated
Table
Partitioned
Table
(Buckets Q-W)Partition
Replica
(Buckets I-P)
Data partitioned with one or more replicas
25. Linearly scale with shared partitions
Spark Executor
Spark Executor
Kafka
queue
Subscriber N-Z
Subscriber A-M
Subscriber A-M
Ref data
Linearly scale with partition pruning
Input queue,
Stream, IMDB,
Output queue
all share the
same
partitioning
strategy
26. Point access, updates, fast writes
● Row tables with PKs are distributed HashMaps
○ with secondary indexes
● Support for transactional semantics
○ read_committed, repeatable_read
● Support for scalable high write rates
○ streaming data goes through stages
○ queue streams, intermediate storage (Delta row buffer),
immutable compressed columns
27. Full Spark Compatibility
● Any table is also visible as a DataFrame
● Any RDD[T]/DataFrame can be stored in SnappyData
tables
● Tables appear like any JDBC sourced table
○ But, in executor memory by default
● Addtional API for updates, inserts, deletes
//Save a dataFrame using the spark context …
context.createExternalTable(”T1", "ROW", myDataFrame.schema, props );
//save using DataFrame API
dataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");
28. Extends Spark
CREATE
[Temporary]
TABLE
[IF
NOT
EXISTS]
table_name
(
<column
deIinition>
)
USING
‘JDBC
|
ROW
|
COLUMN
’
OPTIONS
(
COLOCATE_WITH
'table_name',
//
Default
none
PARTITION_BY
'PRIMARY
KEY
|
column
name',
//
will
be
a
replicated
table,
by
default
REDUNDANCY
'1'
,
//
Manage
HA
PERSISTENT
"DISKSTORE_NAME
ASYNCHRONOUS
|
SYNCHRONOUS",
//
Empty
string
will
map
to
default
disk
store.
OFFHEAP
"true
|
false"
EVICTION_BY
"MEMSIZE
200
|
COUNT
200
|
HEAPPERCENT",
…..
[AS
select_statement];
29. Key feature: Synopses Data
● Maintain stratified samples
○ Intelligent sampling to keep error bounds low
● Probabilistic data
○ TopK for time series (using time aggregation CMS, item
aggregation)
○ Histograms, HyperLogLog, Bloom Filters, Wavelets
CREATE SAMPLE TABLE sample-table-name USING columnar
OPTIONS (
BASETABLE ‘table_name’ // source column table or stream table
[ SAMPLINGMETHOD "stratified | uniform" ]
STRATA name (
QCS (“comma-separated-column-names”)
[ FRACTION “frac” ]
),+ // one or more QCS
32. www.snappydata.io
SnappyData is Open Source
● Beta will be on github in January. We are looking for
contributors!
● Learn more & register for beta: www.snappydata.io
● Connect:
○ twitter: www.twitter.com/snappydata
○ facebook: www.facebook.com/snappydata
○ linkedin: www.linkedin.com/snappydata
○ slack: http://snappydata-slackin.herokuapp.com
○ IRC: irc.freenode.net #snappydata