SlideShare a Scribd company logo
1 of 34
Download to read offline
SnappyData
Getting Spark ready for real-time,
operational analytics
www.snappydata.io
Suds Menon
Co-Founder SnappyData
Jan 2016
Last Week Tonight in Big Data
www.snappydata.io
IoT is what makes the big data challenge very real
A 10 Trillion Device World1
www.snappydata.io
1:http://cacm.acm.org/news/191847-get-ready-to-live-in-a-trillion-device-world/fulltext
Because Insights are like people. Useful for a short period of time
The New Arms Race
www.snappydata.io
●  Sift through data to get insights
to improve your business
●  What is your time to insights?
●  What is your time to
operationalizing insights?
Can we use the past to accurately predict the future?
The Holy Grail of Analytics
www.snappydata.io
The faster you go, the bigger your business advantage
Speeding Up Insights
www.snappydata.io
Exploding data volumes fuel the search for distributed solutions
How We Got Here
www.snappydata.io
Teradata
Cognos
GreenPlum
Netezza,
ParAccel
Hadoop
(SQL on
Hadoop)
Spark
(Spark
SQL)
Every enterprise today deals with these 4 kinds of data interactions
The Four Horsemen Of Data
www.snappydata.io
OLTP OLAP Streaming Machine
Learning
Who Are We?
●  An EMC-Pivotal spinout focused on real time operational
analytics
●  New Spark-based open source project started by Pivotal
GemFire founders+engineers
●  Decades of in-memory data management experience
●  Focus on real-time, operational analytics: Spark inside an
OLTP+OLAP database
www.snappydata.io
SnappyData At Cruising Altitude
Single unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real time operational Analytics – TBs in memory
RDB
Rows
Txn
Columnar
API
Stream processing
ODBC,
JDBC, REST
Spark -
Scala, Java,
Python, R
HDFS
AQP
First commercial project on Approximate
Query Processing(AQP)
MPP DB
Index
SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP + Stream
for real-time analytics
Batch design, high throughput
Real-­‐time	
  
design	
  center	
  
-­‐	
  Low	
  latency,	
  HA,	
  
concurrent	
  
Vision: Drastically reduce the cost and
complexity in modern big data
Huge community adoption, slip streaming into Hadoop momentum, great data integration platform
Why Spark?
•  Most events in life can be analyzed as micro batches
•  Blends streaming, interactive, and batch analytics
•  Appeals to Java, R, Python, Scala programmers
•  Rich set of transformations and libraries
•  RDD and fault tolerance without replication
•  Offers Spark SQL as a key capability
www.snappydata.io
Spark is a compute framework that processes data, not an analytics database
Clearing Up Some Spark Myths
www.snappydata.io
●  It is NOT a distributed in-memory database
○  It’s a computational framework with immutable caching
●  It is NOT Highly Available
○  Fault tolerance is not the same as HA
●  NOT well suited for real time, operational environments
○  Does not handle concurrency well
○  Does not share data very well either
SnappyData & Lambda
SnappyData Focus
Perspective on Lambda for real time
In-Memory DB
Interactive queries,
updates
Deep Scale, High
volume
MPP DB
Transform
Data-in-motion
Analytics
Application
Streams
Alerts
RELEVANT USECASES
www.snappydata.io
Market Surveillance
www.snappydata.io
FLAG
DETECT
ANALYZE
INGEST
Identify patterns
based on query
results
Partitioned, HA
stream ingestion
Prevent
settlement,
investigate further
SQL queries &
Stream Analytics
on microbatches
Contextual Marketing
www.snappydata.io
RESPOND
DECIDE
ANALYZE
INGEST
Pick Ad based on
variety of reference
data parameters
Transactional
request for Ad
placement
Deliver in real
time
Join with history, join
with user profile, join
with location
Location Based Telco Services
www.snappydata.io
Geo Fencing Mobile Marketing Network Analytics
●  INGEST, CORRELATE, JOIN WITH HISTORICAL DATA,
RESPOND
Spark Architecture
Driver
Cluster
Manager
(YARN,
Mesos,
Standalone)
Worker
Worker
Worker
Executor
REST API for
Job
Submission
Worker
Worker
Worker
Data
Server
Execut
or
Cluster
Manager
(YARN,
Mesos,
Standalone)
Data
Server
Execut
or
Snappy Infused Spark Architecture
JDBC Clients
ODBC Clients
Job ServerLead Node
Lead Node
Core Components Of SnappyData
Colocated row/column Tables in Spark
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
●  Spark Executors are long lived and shared across multiple apps
●  Gem Memory Mgr and Spark Block Mgr integrated
Table can be partitioned or replicated
Replicated
Table
Partitioned
Table
(Buckets A-H) Replicated
Table
Partitioned
Table
(Buckets I-P)
consistent replica on each node
Partition
Replica
(Buckets A-H)
Replicated
Table
Partitioned
Table
(Buckets Q-W)Partition
Replica
(Buckets I-P)
Data partitioned with one or more replicas
Linearly scale with shared partitions
Spark Executor
Spark Executor
Kafka
queue
Subscriber N-Z
Subscriber A-M
Subscriber A-M
Ref data
Linearly scale with partition pruning
Input queue,
Stream, IMDB,
Output queue
all share the
same
partitioning
strategy
Point access, updates, fast writes
●  Row tables with PKs are distributed HashMaps
○  with secondary indexes
●  Support for transactional semantics
○  read_committed, repeatable_read
●  Support for scalable high write rates
○  streaming data goes through stages
○  queue streams, intermediate storage (Delta row buffer),
immutable compressed columns
Full Spark Compatibility
●  Any table is also visible as a DataFrame
●  Any RDD[T]/DataFrame can be stored in SnappyData
tables
●  Tables appear like any JDBC sourced table
○  But, in executor memory by default
●  Addtional API for updates, inserts, deletes
//Save a dataFrame using the spark context …
context.createExternalTable(”T1", "ROW", myDataFrame.schema, props );
//save using DataFrame API
dataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");
Extends Spark
CREATE	
  [Temporary]	
  TABLE	
  [IF	
  NOT	
  EXISTS]	
  table_name	
  
	
  	
  	
  (	
  
	
  	
  	
  	
  	
  	
  	
  <column	
  deIinition>	
  	
  
	
  	
  	
  )	
  	
  USING	
  ‘JDBC	
  |	
  ROW	
  |	
  COLUMN	
  ’	
  
OPTIONS	
  (	
  
	
  	
  	
  COLOCATE_WITH	
  'table_name',	
  	
  	
   	
  //	
  Default	
  none	
  
	
  	
  	
  PARTITION_BY	
  'PRIMARY	
  KEY	
  |	
  column	
  name',	
  //	
  will	
  be	
  a	
  replicated	
  table,	
  by	
  default	
  
	
  	
  	
  REDUNDANCY	
  	
  	
  	
  	
  	
  	
  	
  '1'	
  ,	
  	
  	
   	
   	
  //	
  Manage	
  HA	
  
	
  	
  PERSISTENT	
  	
  	
  "DISKSTORE_NAME	
  ASYNCHRONOUS	
  |	
  	
  SYNCHRONOUS",	
  	
  
	
   	
   	
  //	
  Empty	
  string	
  will	
  map	
  to	
  default	
  disk	
  store.	
  	
  
	
  	
  OFFHEAP	
  "true	
  |	
  false"	
  	
  
	
  	
  EVICTION_BY	
  	
  "MEMSIZE	
  200	
  |	
  COUNT	
  200	
  |	
  HEAPPERCENT",	
  
…..	
  	
  
	
  [AS	
  select_statement];	
  
Key feature: Synopses Data
●  Maintain stratified samples
○  Intelligent sampling to keep error bounds low
●  Probabilistic data
○  TopK for time series (using time aggregation CMS, item
aggregation)
○  Histograms, HyperLogLog, Bloom Filters, Wavelets
CREATE SAMPLE TABLE sample-table-name USING columnar
OPTIONS (
BASETABLE ‘table_name’ // source column table or stream table
[ SAMPLINGMETHOD "stratified | uniform" ]
STRATA name (
QCS (“comma-separated-column-names”)
[ FRACTION “frac” ]
),+ // one or more QCS
www.snappydata.io
AQP Architecture
www.snappydata.io
Spot The Differences
www.snappydata.io
SnappyData is Open Source
●  Beta will be on github in January. We are looking for
contributors!
●  Learn more & register for beta: www.snappydata.io
●  Connect:
○  twitter: www.twitter.com/snappydata
○  facebook: www.facebook.com/snappydata
○  linkedin: www.linkedin.com/snappydata
○  slack: http://snappydata-slackin.herokuapp.com
○  IRC: irc.freenode.net #snappydata
Q&A
www.snappydata.io
THANK YOU
www.snappydata.io

More Related Content

What's hot

SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentationpunesparkmeetup
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 

What's hot (20)

SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentation
 
Is hadoop for you
Is hadoop for youIs hadoop for you
Is hadoop for you
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 

Viewers also liked

Apache Geode で始める Spring Data Gemfire
Apache Geode で始めるSpring Data GemfireApache Geode で始めるSpring Data Gemfire
Apache Geode で始める Spring Data GemfireAkihiro Kitada
 
(LT)Spark and Cassandra
(LT)Spark and Cassandra(LT)Spark and Cassandra
(LT)Spark and Cassandradatastaxjp
 
超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODEMasaki Yamakawa
 
Awsでつくるapache kafkaといろんな悩み
Awsでつくるapache kafkaといろんな悩みAwsでつくるapache kafkaといろんな悩み
Awsでつくるapache kafkaといろんな悩みKeigo Suda
 
インメモリーデータグリッドの選択肢
インメモリーデータグリッドの選択肢インメモリーデータグリッドの選択肢
インメモリーデータグリッドの選択肢Masaki Yamakawa
 
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけRDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけRecruit Technologies
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方Yoshiyasu SAEKI
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
 

Viewers also liked (9)

Apache Geode で始める Spring Data Gemfire
Apache Geode で始めるSpring Data GemfireApache Geode で始めるSpring Data Gemfire
Apache Geode で始める Spring Data Gemfire
 
(LT)Spark and Cassandra
(LT)Spark and Cassandra(LT)Spark and Cassandra
(LT)Spark and Cassandra
 
超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE
 
噛み砕いてKafka Streams #kafkajp
噛み砕いてKafka Streams #kafkajp噛み砕いてKafka Streams #kafkajp
噛み砕いてKafka Streams #kafkajp
 
Awsでつくるapache kafkaといろんな悩み
Awsでつくるapache kafkaといろんな悩みAwsでつくるapache kafkaといろんな悩み
Awsでつくるapache kafkaといろんな悩み
 
インメモリーデータグリッドの選択肢
インメモリーデータグリッドの選択肢インメモリーデータグリッドの選択肢
インメモリーデータグリッドの選択肢
 
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけRDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 

Similar to SnappyData Overview Slidedeck for Big Data Bellevue

Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsairisData
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; pythonMaloy Manna, PMP®
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 

Similar to SnappyData Overview Slidedeck for Big Data Bellevue (20)

Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 

Recently uploaded

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 

Recently uploaded (20)

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 

SnappyData Overview Slidedeck for Big Data Bellevue

  • 1. SnappyData Getting Spark ready for real-time, operational analytics www.snappydata.io Suds Menon Co-Founder SnappyData Jan 2016
  • 2. Last Week Tonight in Big Data www.snappydata.io
  • 3. IoT is what makes the big data challenge very real A 10 Trillion Device World1 www.snappydata.io 1:http://cacm.acm.org/news/191847-get-ready-to-live-in-a-trillion-device-world/fulltext
  • 4. Because Insights are like people. Useful for a short period of time The New Arms Race www.snappydata.io ●  Sift through data to get insights to improve your business ●  What is your time to insights? ●  What is your time to operationalizing insights?
  • 5. Can we use the past to accurately predict the future? The Holy Grail of Analytics www.snappydata.io
  • 6. The faster you go, the bigger your business advantage Speeding Up Insights www.snappydata.io
  • 7. Exploding data volumes fuel the search for distributed solutions How We Got Here www.snappydata.io Teradata Cognos GreenPlum Netezza, ParAccel Hadoop (SQL on Hadoop) Spark (Spark SQL)
  • 8. Every enterprise today deals with these 4 kinds of data interactions The Four Horsemen Of Data www.snappydata.io OLTP OLAP Streaming Machine Learning
  • 9. Who Are We? ●  An EMC-Pivotal spinout focused on real time operational analytics ●  New Spark-based open source project started by Pivotal GemFire founders+engineers ●  Decades of in-memory data management experience ●  Focus on real-time, operational analytics: Spark inside an OLTP+OLAP database www.snappydata.io
  • 10. SnappyData At Cruising Altitude Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real time operational Analytics – TBs in memory RDB Rows Txn Columnar API Stream processing ODBC, JDBC, REST Spark - Scala, Java, Python, R HDFS AQP First commercial project on Approximate Query Processing(AQP) MPP DB Index
  • 11. SnappyData: A new approach Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real-­‐time   design  center   -­‐  Low  latency,  HA,   concurrent   Vision: Drastically reduce the cost and complexity in modern big data
  • 12. Huge community adoption, slip streaming into Hadoop momentum, great data integration platform Why Spark? •  Most events in life can be analyzed as micro batches •  Blends streaming, interactive, and batch analytics •  Appeals to Java, R, Python, Scala programmers •  Rich set of transformations and libraries •  RDD and fault tolerance without replication •  Offers Spark SQL as a key capability www.snappydata.io
  • 13. Spark is a compute framework that processes data, not an analytics database Clearing Up Some Spark Myths www.snappydata.io ●  It is NOT a distributed in-memory database ○  It’s a computational framework with immutable caching ●  It is NOT Highly Available ○  Fault tolerance is not the same as HA ●  NOT well suited for real time, operational environments ○  Does not handle concurrency well ○  Does not share data very well either
  • 15. Perspective on Lambda for real time In-Memory DB Interactive queries, updates Deep Scale, High volume MPP DB Transform Data-in-motion Analytics Application Streams Alerts
  • 17. Market Surveillance www.snappydata.io FLAG DETECT ANALYZE INGEST Identify patterns based on query results Partitioned, HA stream ingestion Prevent settlement, investigate further SQL queries & Stream Analytics on microbatches
  • 18. Contextual Marketing www.snappydata.io RESPOND DECIDE ANALYZE INGEST Pick Ad based on variety of reference data parameters Transactional request for Ad placement Deliver in real time Join with history, join with user profile, join with location
  • 19. Location Based Telco Services www.snappydata.io Geo Fencing Mobile Marketing Network Analytics ●  INGEST, CORRELATE, JOIN WITH HISTORICAL DATA, RESPOND
  • 22. Core Components Of SnappyData
  • 23. Colocated row/column Tables in Spark Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing ●  Spark Executors are long lived and shared across multiple apps ●  Gem Memory Mgr and Spark Block Mgr integrated
  • 24. Table can be partitioned or replicated Replicated Table Partitioned Table (Buckets A-H) Replicated Table Partitioned Table (Buckets I-P) consistent replica on each node Partition Replica (Buckets A-H) Replicated Table Partitioned Table (Buckets Q-W)Partition Replica (Buckets I-P) Data partitioned with one or more replicas
  • 25. Linearly scale with shared partitions Spark Executor Spark Executor Kafka queue Subscriber N-Z Subscriber A-M Subscriber A-M Ref data Linearly scale with partition pruning Input queue, Stream, IMDB, Output queue all share the same partitioning strategy
  • 26. Point access, updates, fast writes ●  Row tables with PKs are distributed HashMaps ○  with secondary indexes ●  Support for transactional semantics ○  read_committed, repeatable_read ●  Support for scalable high write rates ○  streaming data goes through stages ○  queue streams, intermediate storage (Delta row buffer), immutable compressed columns
  • 27. Full Spark Compatibility ●  Any table is also visible as a DataFrame ●  Any RDD[T]/DataFrame can be stored in SnappyData tables ●  Tables appear like any JDBC sourced table ○  But, in executor memory by default ●  Addtional API for updates, inserts, deletes //Save a dataFrame using the spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props ); //save using DataFrame API dataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");
  • 28. Extends Spark CREATE  [Temporary]  TABLE  [IF  NOT  EXISTS]  table_name        (                <column  deIinition>          )    USING  ‘JDBC  |  ROW  |  COLUMN  ’   OPTIONS  (        COLOCATE_WITH  'table_name',        //  Default  none        PARTITION_BY  'PRIMARY  KEY  |  column  name',  //  will  be  a  replicated  table,  by  default        REDUNDANCY                '1'  ,          //  Manage  HA      PERSISTENT      "DISKSTORE_NAME  ASYNCHRONOUS  |    SYNCHRONOUS",          //  Empty  string  will  map  to  default  disk  store.        OFFHEAP  "true  |  false"        EVICTION_BY    "MEMSIZE  200  |  COUNT  200  |  HEAPPERCENT",   …..      [AS  select_statement];  
  • 29. Key feature: Synopses Data ●  Maintain stratified samples ○  Intelligent sampling to keep error bounds low ●  Probabilistic data ○  TopK for time series (using time aggregation CMS, item aggregation) ○  Histograms, HyperLogLog, Bloom Filters, Wavelets CREATE SAMPLE TABLE sample-table-name USING columnar OPTIONS ( BASETABLE ‘table_name’ // source column table or stream table [ SAMPLINGMETHOD "stratified | uniform" ] STRATA name ( QCS (“comma-separated-column-names”) [ FRACTION “frac” ] ),+ // one or more QCS
  • 32. www.snappydata.io SnappyData is Open Source ●  Beta will be on github in January. We are looking for contributors! ●  Learn more & register for beta: www.snappydata.io ●  Connect: ○  twitter: www.twitter.com/snappydata ○  facebook: www.facebook.com/snappydata ○  linkedin: www.linkedin.com/snappydata ○  slack: http://snappydata-slackin.herokuapp.com ○  IRC: irc.freenode.net #snappydata