SlideShare a Scribd company logo
1 of 42
SnappyData
A Unified Cluster for Streaming, Transactions,& Interactive Analytics
© Snappydata Inc 2017
www.Snappydata.io
Jags Ramnarayan
CTO, CoFounder
Our Pedigree
2
GTD Ventures Team : ~ 30
- Founded GemFire (In-memory data grid)
- Pivotal Spin out
- 25+ VMWare, Pivotal Database Engineers
Investors 
Mixed Workloads Are Everywhere
3
Stream
Processing
Transaction Interactive
Analytics
Analyticsonmutatingdata
Correlatingand joiningstreams
with large histories
Maintainingstateorcounters
whileingestingstreams
Telco Use Case : Location based Services, network optimization
4
Revenue Generation
Real-time Location based
Mobile Advertising (B2B2C)
Location Based Services (B2C,
B2B, B2B2C)
Revenue Protection
Customer experience
management to reduce churn
Customers Sentiment analysis
Network Efficiency
Network bandwidth optimisation
Network signalling maximisation
• Network optimization
– E.g. re-reroute call to another cell tower if congestion detected
• Location based Ads
– Match incoming event to Subscriber profile; If ‘Opt-in’ show location sensitive Ad
• Challenge: Streaming analytics, interactive real-time dashboards
● Simple rules - (CallDroppedCount > threshold) then alert
● Or, Complex (OLAP like query)
● TopK, Trending, Join with reference data, correlate with history
Stream processor today
Need: Stream Analytics
Stream pipeline
5
Perpetually
Growing
…
Maintain
summaries
Reference data
-- External systems
Time Window
Aggregated
Summary
Ref Tables
Stream
Process
Stream Sink
TableIOT
Devices
….
Cell
• IoT sensors,
• Call Data Record(CDR),
• AdImpressions..
Number of Processing Challenges
6
Perpetually
Growing
…
Maintain
summaries
Reference data
-- External systems
Time Window
Aggregated
Summary
Stream Sink
Table
Ref Tables
Stream
Process
JOIN – streams,
large table,
reference data
- Window could be
large (an hour)
sliding every 2
seconds
- Fast Writes
- Updates
- Exactly once
semantics
- HA
JOIN
Number of Processing Challenges
7
Time Window
Aggregated
Summary
Stream Sink
Ref Tables
Stream
Process
INTERACTIVE
QUERIES (ad-hoc)
- High concurrency
- Point lookup,
scan/aggregations
Why Supporting Mixed Workloads is Difficult?
Data Structures
Query Processing
Paradigm
Scheduling &
Provisioning
Columnar
Batch
Processing
Long-running
Row stores
Point
Lookups
Short-lived
Sketches
Delta /
Incremental
Bursty
Lambda Architecture
9
Query
New
Data
Batch
layer
Master
Datasheet
2
Serving layer
Batch
view
3
Batch
view
Speed
layer
4
Real-time
View
Real-time
View
1
Query
5
Storm,
Spark
Streaming,
Samza…
HDFS,
Hbase
SQL On
Hadoop,
MPP DB
Lambda Architecture is Complex
10
• Complexity
• Learn and master multiple products,
data models, disparate APIs & configs
• Wasted resources
• Slower
• Excessive copying, serialization,
shuffles
• Impossible to achieve interactive-speed
analytics on large or mutating data
• Cannot update
• Repeated for each
User/App
APP1
Spark
Master
Spark Executor (Worker)
Framework for
streaming
SQL, ML…
Immutable
CACHE
APP2
Spark
Master
Spark Executor (Worker)
Framework for
streaming
SQL, ML…
Immutable
CACHE
SQL
NoSQL
Bottleneck
Spark Streaming with NoSQL for State
11
1. Pushdown filter to NoSQL partition
2. Serialize, de-serialize to Spark executor
3. Multiple copies of large data sets
4. Lose optimization - vectorization
Interactive, Continuous queries TOO SLOW
12
Can We
Simplify &
Optimize?
13
Our
Solution
SnappyData
A SingleUnifiedCluster:
OLTP+OLAP+ Streamingforreal-timeanalytics
Our Solution – FUSE Spark with In-memory database
14
Deep Scale,
High Volume
MPP DB
Real-time design
Low latency, HA,
concurrencyBatch design, high
throughput, Rich API,
Eco-system
Maturedover 13 years
Single Unified HA Cluster
OLTP+ OLAP + StreamingforReal-timeAnalytics
• Cannot update
• Repeated for each
User/App
USER 1/APP1
Spark
Master
Spark Executor (Worker)
Framework for
streaming
SQL, ML…
Immutable
CACHE
USER 2/APP2
Spark
Master
Spark Executor (Worker)
Framework for
streaming
SQL, ML…
Immutable
CACHE
HDFS
SQL
NoSQL
Bottleneck
We Transform Spark from a computational engine …
15
… Into an “Always-On” Hybrid Database !
16
Deep Scale,
High Volume
MPP DB
HDFS
SQL
NoSQL
HISTORY
Spark Executor (Worker)JVM
-Long running
Spark Runtime
Stream
process,
SQL,
ML…
Spark
Driver
In-Memory
ROW + COLUMN
In-memory
Indexes
Store
- Mutable,
- TransactionalSpark
Cluster
JDBC
ODBC
SparkJob
Shared Nothing
Persistence
… Into an “Always-On” Hybrid Database !
17
Spark API
(Streaming, ML, Graph)
Transactions
, Indexing
Full SQL HA
DataFrame,
RDD, DataSets
RowsColumnar
IN-MEMORY
Spark Cache
Synopses
(Samples)
Unified Data Access
(Virtual Tables)
Unified CatalogNative Store
SNAPPYDATA
HDFS/HBAS
E
S3
JSON, CSV,
XML
SQL db Cassandra MPP DB
Stream
sources
Spark Jobs, Scala/Java/Python/R API, JDBC/ODBC, Object API (RDD, DataSets)
Overview
18
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
TRX
Data Synopsis Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
Tables ODBC/JDBC
Unified Data Model & API
19
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
TRX
Data Synopsis Engine
Distributed Membership
Service
H
A
Stream Processing
Tables ODBC/JDBCData Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
• Mutability(DML+Trx)
• Indexing
• SQL-basedstreaming
Hybrid Store
20
Unbounded
Streams Ingestion
Real time
Sampling
Transactional
State Update
Probabilistic
IndexRows
Row
Buffer
Columns
Random Writes
( Reference data )
OLAP
Stream Analytics
Row table
Column table
Sample table
Simple API and Spark Compatible
21
// Use the Spark DATA SOURCE API
val snappy = new org.apache.spark.sql.SnappySession(spark.sparkContext)
snappy.createTable(“TableName”, “Row | Column | Sample”, schema, options )
someDataFrame.write.insertInto(“TableName”)
// Update, Delete, Put using the SnappySession
snappy.update(“tableName”, filterExpr, Row(<newColumnValues>), updatedColumns )
snappy.delete(“tableName”, filterExpr)
// Or, just Use Spark SQL syntax ..
Snappy.sql(“select x from tableName”).count
// Or, JDBC, ODBC to access like a regular Database
jdbcStatement.executeUpdate(“insert into tableName values …”)
Extends Spark SQL
22
CREATE [Temporary] TABLE [IF NOT EXISTS] table_name
(
<column definition>
) USING ‘ROW | COLUMN ’
OPTIONS (
COLOCATE_WITH 'table_name',
PARTITION_BY 'PRIMARY KEY | column name', // Replicated table, by default
REDUNDANCY '1' , // Manage HA
PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Default – only in-memory
OFFHEAP "true | false"
EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",
…..
[AS select_statement];
Stream SQL DDL and Continuous queries (based on Spark Streaming)
23
Consume from stream
Transform raw data
Continuous Analytics
Ingest into in-memory Store
Overflow table to HDFS
Create stream table AdImpressionLog
(<Columns>) using directkafka_stream options (
<socket endpoints>
"topics 'adnetwork-topic’ “,
"rowConverter ’ AdImpressionLogAvroDecoder’ )
streamingContext.registerCQ(
"select publisher, geo, avg(bid) as avg_bid, count(*) imps,
count(distinct(cookie)) uniques from AdImpressionLog
window (duration '2' seconds, slide '2' seconds)
where geo != 'unknown' group by publisher, geo”)//
Register CQ
.foreachDataFrame(df => {
df.write.format("column").mode(SaveMode.Append)
.saveAsTable("adImpressions")
Updates & Deletes on Column Tables
24
Column Segment ( t1-t2)
Column Segment ( t2-t3)
0
1
0
0
0
0
1
1
0
K11
K12
.
.
.
.
.
C11
C12
.
.
.
.
.
C21
C22
.
.
.
.
.
Summary Metadata
PeriodicCompaction
One Partition
Time
WRITE
Row Buffer
MVCC
New Segment
Replicate
for HA
Can we use Statistical techniques to shrink data?
25
• Most apps happy to tradeoff 1% accuracy for
200x speedup!
• Can usually get a 99.9% accurate answer by only
looking at a tiny fraction of data!
• Often can make perfectly accurate decisions
with imperfect answers!
• A/B Testing, visualization, ...
• The data itself is usually noisy
• Processing entire data doesn’t necessarily mean exact
answers!
`
Probabilistic Store: Sketches + Uniform & Stratified Samples
Higher resolution for more recent
time ranges
1. Streaming CMS(Count-Min-Sketch)
[t1, t2) [t2, t3) [t3, t4) [t4, now) Time
4T 2T T ≤T
....
Maintain a small sample at each CMS cell
2. Top-K Queries w/ArbitraryFilters
Traditional CMS CMS+Samples
3. Fully Distributed Stratified Samples
Always include timestamp as a stratified column
for streams
Streams
AgingRow Store (In-memory) Column Store (Disk)
timestamp
Overview
27
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
Transact
Data Synopsis Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
Tables ODBC/JDBC
Supporting Real-time & HA
28
Locator
Lead Node
Executor JVM (Server)
Shared Nothing
Persistence
JDBC/
ODBC
Catalog Service
Managed Driver
SPARK
Contacts
SPARK
Context
SNAPPY
Cluster
Manager
REST
SPARK JOBS
SPARK
Program
Memory Mgmt
BLOCKS SNAPPY STORE
Stream SNAPPY
Tables
Tables
DataFrame
• Spark Executors are
long running. Driver
failure doesn’t shutdown
Executors
• Driver HA – Drivers are
“Managed” by
SnappyData with standby
secondary
• Data HA – Consensus
based clustering
integrated for eager
replication
DataFrame
Peer-2-peer
Overview
29
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
Transact
Data Synopsis Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
Tables ODBC/JDBC
Transactions
30
• Support for Read Committed & Repeatable Read
• W-W and R-W conflict detection at write time
• MVCC for non blocking reads and snapshot isolation
• Distributed system failure detection integrated with commit protocol
- Evict unresponsive replicas
- Ensure consistency when replicas recover
Overview
31
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
Transact
Data Synopsis Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
Tables ODBC/JDBC
Query Optimization
• Bypass the scheduler for transactions and low-latency jobs
• Minimize shuffles aggressively
- Dynamic replication for reference data
- Retain ‘join indexes’ whenever possible
- Collocate related data sets
• Optimized ‘Hash Join’, ‘Scan’, ‘GroupBy’ compared to Spark
- Uses more variables in code generation, vectorized structures
• Column segment pruning through statistics
Co-partitioning & Co-location
33
Spark Executor Subscriber A-M
Ref data
Spark Executor Subscriber N-Z
Ref data
Linearlyscalewithpartitionpruning
Subscriber A-M
Subscriber N-Z
KAFKA
Queue
KAFKA
Queue
Overview
34
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
Transact
Data Synopsis Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
Tables ODBC/JDBC
Approximate Query Processing (AQP): Academia vs. Industry
35
25+yrsofsuccessful research in academia User-facing AQP almost
non-existent in commercial world!
Some approximate features in Infobright, Yahoo’s
Druid, Facebook’s Presto, Oracle 12C, ...
AQUA, Online Aggregation, MapReduce
Online, STRAT, ABS, BlinkDB / G-OLA,
...
WHY ?
BUT:
select geo, avg(bid)
from adImpressions
group by geo having
avg(bid)>10
with error 0.05 at confidence 95
geo avg(bid) error prob_existence
MI 21.5 ± 0.4 0.99
CA 18.3 ± 5.1 0.80
MA 15.6 ± 2.4 0.81
... ... ... ....
1. Incompatible w/ BItools
2. Complex semantics
3. Bad sales pitch!
A First Industrial-Grade AQP Engine
1. Highlevel Accuracy Contract (HAC)
• Concurrency: 10’s of queries in shared clusters
• Resource usage: everyone hates their AWS bill
• Network shuffles
• Immediate results while waiting for final results
2. Fully compatible w/BItools
• Set HAC behavior at JDBC/ODBC connection level
3. Better marketing!
• User picks a single number p, where 0≤p≤1 (by
default p=0.95)
• Snappy guarantees that s/he only sees things that
are at least p% accurate
• Snappy handles (and hides) everything else!
geo avg(bid)
MI 21.5
WI 42.3
NY 65.6
... ...
iSight (Immediate
inSight)
Conclusion
Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.
○ CPU contention drops
● Far less complex
○ single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
● Much faster
○ compressed data managed in distributed memory in columnar
form reduces volume and is much more responsive
Lessons Learned
2. A unified cluster is simpler, cheaper, andfaster
- By sharing state across apps, we decouple apps from data servers and provide HA
- Save memory, data copying, serialization, and shuffles
- Co-partitioning and co-location for faster joins and stream analytics
3. Advantages over HTAP engines: Deep stream integration +AQP
1. A unique experience marryingstwo different breeds ofdistributed systems
lineage-based for high-throughput vs. (consensus-) replication-based for low-latency
- Stream processing ≠ stream analytics
- Top-k w/ almost arbitrary predicates + 1-pass stratified sampling over streams
4. Commercializing academic workis lots ofwork but alsolots offun
THANK YOU !
Try our iSight cloud for free:
http://snappydata.io/iSight
iSight: Immediate inSight
iSight’s immediate
answer to the
query: 1.7 secs
Final answer to the
query: 42.7 secs
25x speedup!
Our Solution:
Highlevel Accuracy Contract (HAC)
• A single number 0≤p≤1 (by default p=0.95)
• We guarantee that you only see things that
are at least p% accurate
• We handle (and hide) everything else
– Choose a behavior: REPLACE WITH SPECIAL
SYMBOL (default), DO NOTHING, DROP THE ROW)

More Related Content

What's hot

SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentationpunesparkmeetup
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateCloudera, Inc.
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsDatabricks
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 

What's hot (20)

SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentation
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Is hadoop for you
Is hadoop for youIs hadoop for you
Is hadoop for you
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 

Viewers also liked

A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksGoDataDriven
 
Building Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache GeodeBuilding Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache Geodeimcpune
 
Part 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11gPart 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11gMark Rittman
 
(LT)Spark and Cassandra
(LT)Spark and Cassandra(LT)Spark and Cassandra
(LT)Spark and Cassandradatastaxjp
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODEMasaki Yamakawa
 
A Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataA Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataEdward Hsu
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Mark Rittman
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland
 
Hi Speed Datawarehousing
Hi Speed DatawarehousingHi Speed Datawarehousing
Hi Speed DatawarehousingJos van Dongen
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI DutchJos van Dongen
 
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®Cambridge Semantics
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo UnstructuredCambridge Semantics
 
Always On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraAlways On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraRobbie Strickland
 
Graph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise ScaleGraph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise ScaleCambridge Semantics
 
Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Jos van Dongen
 

Viewers also liked (20)

A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
 
Building Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache GeodeBuilding Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache Geode
 
Part 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11gPart 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11g
 
(LT)Spark and Cassandra
(LT)Spark and Cassandra(LT)Spark and Cassandra
(LT)Spark and Cassandra
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE
 
A Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataA Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big Data
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Hi Speed Datawarehousing
Hi Speed DatawarehousingHi Speed Datawarehousing
Hi Speed Datawarehousing
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI Dutch
 
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo Unstructured
 
Always On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraAlways On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on Cassandra
 
Graph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise ScaleGraph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise Scale
 
Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?
 

Similar to SnappyData, the Spark Database. A unified cluster for streaming, transactions & interactive analytics

Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesSnappyData
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexApache Apex
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsAsis Mohanty
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsVMware Tanzu
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataCarlos Andrés García
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataVMware Tanzu
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit
 
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaMachine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaData Con LA
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
 
BigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache ApexBigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache ApexThomas Weise
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
Data & Analytics Forum: Moving Telcos to Real Time
Data & Analytics Forum: Moving Telcos to Real TimeData & Analytics Forum: Moving Telcos to Real Time
Data & Analytics Forum: Moving Telcos to Real TimeSingleStore
 

Similar to SnappyData, the Spark Database. A unified cluster for streaming, transactions & interactive analytics (20)

Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out Databases
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyData
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyData
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaMachine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh Poduska
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
BigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache ApexBigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache Apex
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Data & Analytics Forum: Moving Telcos to Real Time
Data & Analytics Forum: Moving Telcos to Real TimeData & Analytics Forum: Moving Telcos to Real Time
Data & Analytics Forum: Moving Telcos to Real Time
 
Spark cep
Spark cepSpark cep
Spark cep
 

Recently uploaded

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

SnappyData, the Spark Database. A unified cluster for streaming, transactions & interactive analytics

  • 1. SnappyData A Unified Cluster for Streaming, Transactions,& Interactive Analytics © Snappydata Inc 2017 www.Snappydata.io Jags Ramnarayan CTO, CoFounder
  • 2. Our Pedigree 2 GTD Ventures Team : ~ 30 - Founded GemFire (In-memory data grid) - Pivotal Spin out - 25+ VMWare, Pivotal Database Engineers Investors 
  • 3. Mixed Workloads Are Everywhere 3 Stream Processing Transaction Interactive Analytics Analyticsonmutatingdata Correlatingand joiningstreams with large histories Maintainingstateorcounters whileingestingstreams
  • 4. Telco Use Case : Location based Services, network optimization 4 Revenue Generation Real-time Location based Mobile Advertising (B2B2C) Location Based Services (B2C, B2B, B2B2C) Revenue Protection Customer experience management to reduce churn Customers Sentiment analysis Network Efficiency Network bandwidth optimisation Network signalling maximisation • Network optimization – E.g. re-reroute call to another cell tower if congestion detected • Location based Ads – Match incoming event to Subscriber profile; If ‘Opt-in’ show location sensitive Ad • Challenge: Streaming analytics, interactive real-time dashboards ● Simple rules - (CallDroppedCount > threshold) then alert ● Or, Complex (OLAP like query) ● TopK, Trending, Join with reference data, correlate with history Stream processor today Need: Stream Analytics
  • 5. Stream pipeline 5 Perpetually Growing … Maintain summaries Reference data -- External systems Time Window Aggregated Summary Ref Tables Stream Process Stream Sink TableIOT Devices …. Cell • IoT sensors, • Call Data Record(CDR), • AdImpressions..
  • 6. Number of Processing Challenges 6 Perpetually Growing … Maintain summaries Reference data -- External systems Time Window Aggregated Summary Stream Sink Table Ref Tables Stream Process JOIN – streams, large table, reference data - Window could be large (an hour) sliding every 2 seconds - Fast Writes - Updates - Exactly once semantics - HA JOIN
  • 7. Number of Processing Challenges 7 Time Window Aggregated Summary Stream Sink Ref Tables Stream Process INTERACTIVE QUERIES (ad-hoc) - High concurrency - Point lookup, scan/aggregations
  • 8. Why Supporting Mixed Workloads is Difficult? Data Structures Query Processing Paradigm Scheduling & Provisioning Columnar Batch Processing Long-running Row stores Point Lookups Short-lived Sketches Delta / Incremental Bursty
  • 10. Lambda Architecture is Complex 10 • Complexity • Learn and master multiple products, data models, disparate APIs & configs • Wasted resources • Slower • Excessive copying, serialization, shuffles • Impossible to achieve interactive-speed analytics on large or mutating data
  • 11. • Cannot update • Repeated for each User/App APP1 Spark Master Spark Executor (Worker) Framework for streaming SQL, ML… Immutable CACHE APP2 Spark Master Spark Executor (Worker) Framework for streaming SQL, ML… Immutable CACHE SQL NoSQL Bottleneck Spark Streaming with NoSQL for State 11 1. Pushdown filter to NoSQL partition 2. Serialize, de-serialize to Spark executor 3. Multiple copies of large data sets 4. Lose optimization - vectorization Interactive, Continuous queries TOO SLOW
  • 14. Our Solution – FUSE Spark with In-memory database 14 Deep Scale, High Volume MPP DB Real-time design Low latency, HA, concurrencyBatch design, high throughput, Rich API, Eco-system Maturedover 13 years Single Unified HA Cluster OLTP+ OLAP + StreamingforReal-timeAnalytics
  • 15. • Cannot update • Repeated for each User/App USER 1/APP1 Spark Master Spark Executor (Worker) Framework for streaming SQL, ML… Immutable CACHE USER 2/APP2 Spark Master Spark Executor (Worker) Framework for streaming SQL, ML… Immutable CACHE HDFS SQL NoSQL Bottleneck We Transform Spark from a computational engine … 15
  • 16. … Into an “Always-On” Hybrid Database ! 16 Deep Scale, High Volume MPP DB HDFS SQL NoSQL HISTORY Spark Executor (Worker)JVM -Long running Spark Runtime Stream process, SQL, ML… Spark Driver In-Memory ROW + COLUMN In-memory Indexes Store - Mutable, - TransactionalSpark Cluster JDBC ODBC SparkJob Shared Nothing Persistence
  • 17. … Into an “Always-On” Hybrid Database ! 17 Spark API (Streaming, ML, Graph) Transactions , Indexing Full SQL HA DataFrame, RDD, DataSets RowsColumnar IN-MEMORY Spark Cache Synopses (Samples) Unified Data Access (Virtual Tables) Unified CatalogNative Store SNAPPYDATA HDFS/HBAS E S3 JSON, CSV, XML SQL db Cassandra MPP DB Stream sources Spark Jobs, Scala/Java/Python/R API, JDBC/ODBC, Object API (RDD, DataSets)
  • 18. Overview 18 Cluster Manager & Scheduler Snappy Data Server (SparkExecutor+ Store) Parser OLAP TRX Data Synopsis Engine Distributed Membership Service H A Stream Processing Data Frame RDD Low Latency High Latency HYBRID Store Probabilistic Rows Columns Index Query Optimizer Add / Remove Server Tables ODBC/JDBC
  • 19. Unified Data Model & API 19 Cluster Manager & Scheduler Snappy Data Server (SparkExecutor+ Store) Parser OLAP TRX Data Synopsis Engine Distributed Membership Service H A Stream Processing Tables ODBC/JDBCData Frame RDD Low Latency High Latency HYBRID Store Probabilistic Rows Columns Index Query Optimizer Add / Remove Server • Mutability(DML+Trx) • Indexing • SQL-basedstreaming
  • 20. Hybrid Store 20 Unbounded Streams Ingestion Real time Sampling Transactional State Update Probabilistic IndexRows Row Buffer Columns Random Writes ( Reference data ) OLAP Stream Analytics Row table Column table Sample table
  • 21. Simple API and Spark Compatible 21 // Use the Spark DATA SOURCE API val snappy = new org.apache.spark.sql.SnappySession(spark.sparkContext) snappy.createTable(“TableName”, “Row | Column | Sample”, schema, options ) someDataFrame.write.insertInto(“TableName”) // Update, Delete, Put using the SnappySession snappy.update(“tableName”, filterExpr, Row(<newColumnValues>), updatedColumns ) snappy.delete(“tableName”, filterExpr) // Or, just Use Spark SQL syntax .. Snappy.sql(“select x from tableName”).count // Or, JDBC, ODBC to access like a regular Database jdbcStatement.executeUpdate(“insert into tableName values …”)
  • 22. Extends Spark SQL 22 CREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘ROW | COLUMN ’ OPTIONS ( COLOCATE_WITH 'table_name', PARTITION_BY 'PRIMARY KEY | column name', // Replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS", // Default – only in-memory OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT", ….. [AS select_statement];
  • 23. Stream SQL DDL and Continuous queries (based on Spark Streaming) 23 Consume from stream Transform raw data Continuous Analytics Ingest into in-memory Store Overflow table to HDFS Create stream table AdImpressionLog (<Columns>) using directkafka_stream options ( <socket endpoints> "topics 'adnetwork-topic’ “, "rowConverter ’ AdImpressionLogAvroDecoder’ ) streamingContext.registerCQ( "select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds) where geo != 'unknown' group by publisher, geo”)// Register CQ .foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Append) .saveAsTable("adImpressions")
  • 24. Updates & Deletes on Column Tables 24 Column Segment ( t1-t2) Column Segment ( t2-t3) 0 1 0 0 0 0 1 1 0 K11 K12 . . . . . C11 C12 . . . . . C21 C22 . . . . . Summary Metadata PeriodicCompaction One Partition Time WRITE Row Buffer MVCC New Segment Replicate for HA
  • 25. Can we use Statistical techniques to shrink data? 25 • Most apps happy to tradeoff 1% accuracy for 200x speedup! • Can usually get a 99.9% accurate answer by only looking at a tiny fraction of data! • Often can make perfectly accurate decisions with imperfect answers! • A/B Testing, visualization, ... • The data itself is usually noisy • Processing entire data doesn’t necessarily mean exact answers!
  • 26. ` Probabilistic Store: Sketches + Uniform & Stratified Samples Higher resolution for more recent time ranges 1. Streaming CMS(Count-Min-Sketch) [t1, t2) [t2, t3) [t3, t4) [t4, now) Time 4T 2T T ≤T .... Maintain a small sample at each CMS cell 2. Top-K Queries w/ArbitraryFilters Traditional CMS CMS+Samples 3. Fully Distributed Stratified Samples Always include timestamp as a stratified column for streams Streams AgingRow Store (In-memory) Column Store (Disk) timestamp
  • 27. Overview 27 Cluster Manager & Scheduler Snappy Data Server (SparkExecutor+ Store) Parser OLAP Transact Data Synopsis Engine Distributed Membership Service H A Stream Processing Data Frame RDD Low Latency High Latency HYBRID Store Probabilistic Rows Columns Index Query Optimizer Add / Remove Server Tables ODBC/JDBC
  • 28. Supporting Real-time & HA 28 Locator Lead Node Executor JVM (Server) Shared Nothing Persistence JDBC/ ODBC Catalog Service Managed Driver SPARK Contacts SPARK Context SNAPPY Cluster Manager REST SPARK JOBS SPARK Program Memory Mgmt BLOCKS SNAPPY STORE Stream SNAPPY Tables Tables DataFrame • Spark Executors are long running. Driver failure doesn’t shutdown Executors • Driver HA – Drivers are “Managed” by SnappyData with standby secondary • Data HA – Consensus based clustering integrated for eager replication DataFrame Peer-2-peer
  • 29. Overview 29 Cluster Manager & Scheduler Snappy Data Server (SparkExecutor+ Store) Parser OLAP Transact Data Synopsis Engine Distributed Membership Service H A Stream Processing Data Frame RDD Low Latency High Latency HYBRID Store Probabilistic Rows Columns Index Query Optimizer Add / Remove Server Tables ODBC/JDBC
  • 30. Transactions 30 • Support for Read Committed & Repeatable Read • W-W and R-W conflict detection at write time • MVCC for non blocking reads and snapshot isolation • Distributed system failure detection integrated with commit protocol - Evict unresponsive replicas - Ensure consistency when replicas recover
  • 31. Overview 31 Cluster Manager & Scheduler Snappy Data Server (SparkExecutor+ Store) Parser OLAP Transact Data Synopsis Engine Distributed Membership Service H A Stream Processing Data Frame RDD Low Latency High Latency HYBRID Store Probabilistic Rows Columns Index Query Optimizer Add / Remove Server Tables ODBC/JDBC
  • 32. Query Optimization • Bypass the scheduler for transactions and low-latency jobs • Minimize shuffles aggressively - Dynamic replication for reference data - Retain ‘join indexes’ whenever possible - Collocate related data sets • Optimized ‘Hash Join’, ‘Scan’, ‘GroupBy’ compared to Spark - Uses more variables in code generation, vectorized structures • Column segment pruning through statistics
  • 33. Co-partitioning & Co-location 33 Spark Executor Subscriber A-M Ref data Spark Executor Subscriber N-Z Ref data Linearlyscalewithpartitionpruning Subscriber A-M Subscriber N-Z KAFKA Queue KAFKA Queue
  • 34. Overview 34 Cluster Manager & Scheduler Snappy Data Server (SparkExecutor+ Store) Parser OLAP Transact Data Synopsis Engine Distributed Membership Service H A Stream Processing Data Frame RDD Low Latency High Latency HYBRID Store Probabilistic Rows Columns Index Query Optimizer Add / Remove Server Tables ODBC/JDBC
  • 35. Approximate Query Processing (AQP): Academia vs. Industry 35 25+yrsofsuccessful research in academia User-facing AQP almost non-existent in commercial world! Some approximate features in Infobright, Yahoo’s Druid, Facebook’s Presto, Oracle 12C, ... AQUA, Online Aggregation, MapReduce Online, STRAT, ABS, BlinkDB / G-OLA, ... WHY ? BUT: select geo, avg(bid) from adImpressions group by geo having avg(bid)>10 with error 0.05 at confidence 95 geo avg(bid) error prob_existence MI 21.5 ± 0.4 0.99 CA 18.3 ± 5.1 0.80 MA 15.6 ± 2.4 0.81 ... ... ... .... 1. Incompatible w/ BItools 2. Complex semantics 3. Bad sales pitch!
  • 36. A First Industrial-Grade AQP Engine 1. Highlevel Accuracy Contract (HAC) • Concurrency: 10’s of queries in shared clusters • Resource usage: everyone hates their AWS bill • Network shuffles • Immediate results while waiting for final results 2. Fully compatible w/BItools • Set HAC behavior at JDBC/ODBC connection level 3. Better marketing! • User picks a single number p, where 0≤p≤1 (by default p=0.95) • Snappy guarantees that s/he only sees things that are at least p% accurate • Snappy handles (and hides) everything else! geo avg(bid) MI 21.5 WI 42.3 NY 65.6 ... ... iSight (Immediate inSight)
  • 38. Unified OLAP/OLTP streaming w/ Spark ● Far fewer resources: TB problem becomes GB. ○ CPU contention drops ● Far less complex ○ single cluster for stream ingestion, continuous queries, interactive queries and machine learning ● Much faster ○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
  • 39. Lessons Learned 2. A unified cluster is simpler, cheaper, andfaster - By sharing state across apps, we decouple apps from data servers and provide HA - Save memory, data copying, serialization, and shuffles - Co-partitioning and co-location for faster joins and stream analytics 3. Advantages over HTAP engines: Deep stream integration +AQP 1. A unique experience marryingstwo different breeds ofdistributed systems lineage-based for high-throughput vs. (consensus-) replication-based for low-latency - Stream processing ≠ stream analytics - Top-k w/ almost arbitrary predicates + 1-pass stratified sampling over streams 4. Commercializing academic workis lots ofwork but alsolots offun
  • 40. THANK YOU ! Try our iSight cloud for free: http://snappydata.io/iSight
  • 41. iSight: Immediate inSight iSight’s immediate answer to the query: 1.7 secs Final answer to the query: 42.7 secs 25x speedup!
  • 42. Our Solution: Highlevel Accuracy Contract (HAC) • A single number 0≤p≤1 (by default p=0.95) • We guarantee that you only see things that are at least p% accurate • We handle (and hide) everything else – Choose a behavior: REPLACE WITH SPECIAL SYMBOL (default), DO NOTHING, DROP THE ROW)

Editor's Notes

  1. Many subscribers, lots of 2G/3G/4G voice/data Network events: location events, CDRs, network issues
  2. You have stream that represents the current window, a large table that is perpetually growing, maybe the app is reducing data through continuous aggregations(apriori known summaries) and many smaller reference data tables. The ref data is sourced from other systems like say CRM and are likely to be changing in real time as well.
  3. Challenges: - Stream analytics: Joins between streams, history, reference data for correlations, trends, pattern matching. - Very fast writes, including updates. Potentially transactional semantics. Of course, you need exactly-once guarantees. - Interactive queries: these tend to be ad-hoc in nature. Could be point lookups or aggregation/scan oriented queries. Only some can be satisfied using pre-fabricated aggregation tables. OLAP cube maintenance in streaming world is very difficult. - Many users could be interacting. So, concurrency is important.
  4. Challenges: - Stream analytics: Joins between streams, history, reference data for correlations, trends, pattern matching. - Very fast writes, including updates. Potentially transactional semantics. Of course, you need exactly-once guarantees. - Interactive queries: these tend to be ad-hoc in nature. Could be point lookups or aggregation/scan oriented queries. Only some can be satisfied using pre-fabricated aggregation tables. OLAP cube maintenance in streaming world is very difficult. - Many users could be interacting. So, concurrency is important.
  5. when mentioning sketches say “infinite streams”
  6. Many purpose built products To derive meaningful insight Apps have to stitch products
  7. According to the latest report, 75% of enterprises believe real time ranges from important to critical for business success Why they need realtime: 52% said for delivering real-time dashboards and improving customer experience Define real-time: 81% of enterprises define real time as either second or sub-second speeds
  8. Snappy store and Spark Executor share the same process space and JVM memory Reference based access – zero copy
  9. Snappy store and Spark Executor share the same process space and JVM memory Reference based access – zero copy
  10. Contributions: sharing state across many clients and applications to minimizes (de- )serialization §5.4, providing high-availability through low-latency failure detection and decoupling applications from data servers §6.1, bypassing the scheduler to interleave fine-grained operations with long-running jobs §6.3, ensuring transactional consistency §6.4. 
 share the same process space and memory pool between Spark and Geode §5.3, co-partition tables and streams to minimize data shuffling §5.5. 
 For unified API say: Spark’s DataFrame API allows for: ML, graph, batch & streaming, SQL (selects) SnappyData adds full SQL support and extends DataFrame and DataSource APIs for: Mutability semantics (DML & transactions) Indexing SQL-based streaming
  11. No changes to the BI tool; no need to be a statistician; and there’s still an API to ask all sorts of detailed error information if you’re an advanved user