SlideShare a Scribd company logo
1 of 29
Low-Latency Analytics with NoSQL
Joe Caserta
June 18, 2014
Storm / Cassandra
Quick Intro - Joe Caserta
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)
Dedicated to Data Warehousing,
Business Intelligence since 1996
Began consulting database
programing and data modeling
25+ years hands-on experience
building database solutions
Founded Caserta Concepts in NYC
Web log analytics solution published
in Intelligent Enterprise
Formalized Alliances / Partnerships –
System Integrators
Partnered with Big Data vendors
Cloudera, HortonWorks, Datameer,
more…
Launched Training practice, teaching
data concepts world-wide
Laser focus on extending Data
Warehouses with Big Data solutions
1986
2004
1996
2009
2001
2010
2013
Launched Big Data Warehousing
Meetup in NYC – 950+ Members
2012
Established best practices for big
data ecosystem implementation
Listed as a Top 20 Data Analytics
Consulting Companies - CIO Review
Expertise & Offerings
Strategic Roadmap /
Assessment / Education /
Implementation
Data Warehousing/
ETL/Data Integration
BI/Visualization/
Analytics
Big Data
Analytics
Client Portfolio
Finance. Healthcare
& Insurance
Retail/eCommerce
& Manufacturing
Education
& Services
Listed as one of the 20 Most Promising
Data Analytics Consulting Companies
CIOReview looked at hundreds of data analytics consulting companies and shortlisted
the ones who are at the forefront of tackling the real analytics challenges.
A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial
board of CIOReview selected the Final 20.
Caserta Concepts
Sales
Marketing
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Big Data Cluster
Canned Reporting
Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Mahout MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Others…
Why is Data Analytics so important?
Data Science
Enterprise
Data Warehouse
• Data is coming in so
fast, how do we monitor
it?
• Real real-time analytics
• Relevance engines,
financial fraud sensors,
early warning sensors
• Dealing with sparse,
incomplete, volatile,
and highly
manufactured data
• Agile to adapt quickly
to changing business
• Wider breadth of
datasets and sources in
scope requires larger
data repositories
• Most of world’s data is
unstructured,
semi-structured or
multi-structured
• Data volume is
growing so
processes must be
more reliant on
programmatic
administration
• Less people/process
dependence
Volume Variety
VelocityVeracity
Challenges With Big Data
What’s Important Today (according to Joe)
Hadoop Distribution: Cloudera, MapR, Hortonworks, Pivotal-HD
 Tools:
 Mahout: Machine learning
 Hive: Map data to structures and use SQL-like queries
 Pig: Data transformation language for big data, from Yahoo
 Storm: Real-time ETL
NoSQL:
Document: MongoDB, CouchDB
Graph: Neo4j, Titan
Key Value: Riak, Redis
Columnar: Cassandra, Hbase
 Languages: SQL, Python, SciPy, Java
 Predictive Modeling: R, SAS, SPSS
Why talk about Storm & Cassandra?
ERP
Finance
Legacy
ETL
Data Analytics
Horizontally Scalable Environment - Optimized for Analytics
Big Data Cluster
Data Science
Big Data BI
NoSQL
Database
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Mahout MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Storm
High Volume Ingestion Project Overview
• The equity trading arm of a large US bank needed to
scale its infrastructure to enable the ability to
process/parse trade data real-time and calculate
aggregations/statistics
~ 1Million/second ~12 Billion messages/day ~240 Billon/month
• The solution needed to map the raw data to a data model
in memory or low latency (for real-time), while persisting
mapped data to disk (for end of day).
• The proposed solution also needed to
handle ad-hoc data requests for data
analytics.
The Data
• Primarily FIX messages: Financial Information Exchange 
• Established in early 90's as a standard for trade data
communication  widely used throughout the industry
• Basically a delimited file of variable attribute-value pairs
• Looks something like this:
8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 |
11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 |
44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 |
10=128 |
• A single trade can be comprised of 1000's of such messages,
although typical trades have about a dozen
Additional Requirements
• Linearly scalable
• Highly available  no single point of failure ,quick recovery
• Quicker time to benefit
• Processing guarantees  NO DATA IS LOST!
Some Sample Analytic Use Cases
• Sum(Notional volume) by Ticker: Daily, Hourly, Minute
• Average trade latency (Execution TS – Order TS)
• Wash Sales (sell within x seconds of last buy) for same
Client/Ticker
NoSQL
Database
(Cassandra)
Messaging
Messaging
Aggregation
Framework
Storm
Ingestand
Computation
External Data
Real-time
Dashboards
Push Aggregates
To MQ
Stream
transaction
Detail
Higher
Latency Analytics
And Day End
Application Log
Data
High Volume Real-timeAnalytics - SolutionArchitecture
A little deeper…
Storm Cluster
Sensor
Data
d3.js Analytics
Hadoop Cluster
Low Latency
Analytics
Atomic data
Aggregates
Event Monitors
• The Kafka messaging system is used for ingestion
• Storm is used for real-time ETL and outputs atomic data
and derived data needed for analytics
• Redis is used as a reference data lookup cache
• Real time analytics are produced from the aggregated
data.
• Higher latency ad-hoc analytics are done in Hadoop
using Pig and Hive
Kafka
What is Storm
• Distributed Event Processor
• Real-time data ingestion and dissemination
• In-Stream ETL
• Reliably process unbounded streams of data
• Storm is fast: Clocked it at over a million tuples per second per node
• It is scalable, fault-tolerant, guarantees your data will be processed
• Preferred technology for real-time big data processing by organizations
worldwide:
• Partial list at https://github.com/nathanmarz/storm/wiki/Powered-By
• Incubator:
• http://wiki.apache.org/incubator/StormProposal
Components of Storm
• Spout – Collects data from upstream feeds and submits
it for processing
• Tuple – A collection of data that is passed within Storm
• Bolt – Processes tuples (Transformations)
• Stream – Identifies outputs from Spouts/Bolts
• Storm usually outputs to a NoSQL database
Why NoSQL?
• Performance:
• Relational databases have a lot of features, overhead that we don’t
need in many cases. Although we will miss some…
• Scalability:
• Most relational databases scale vertically giving them limits to how
large they can get. Federation and Sharding is an awkward manual
process.
• Agile
• Sparse Data / Data with a lot of variation
• Most NoSQL scale horizontally on commodity hardware
• Column families are the equivalent to a table in a RDMS
• Primary unit of storage is a column, they are stored
contiguously
Skinny Rows: Most like relational database. Except
columns are optional and not stored if omitted:
Wide Rows: Rows can be billions of columns wide, used
for time series, relationships, secondary indexes:
What is Cassandra?
Deeper Dive: Cassandra as an Analytic
Database
• Based on a blend of Dynamo and BigTable
• Distributed, master-less
• Super fast writes  Can ingest lots of data!
• Very fast reads
Why did we choose it:
• Data throughput requirements
• High availability
• Simple expansion
• Interesting data models for time series data (more on this
later)
Design Practices
• Cassandra does not support aggregation or joins 
Data model must be tuned to usage
• Denormalize your data (flatten your primary dimensional
attributes into your fact)
• Storing the same data redundantly is OK
Might sound weird but we've been doing this all along
in the traditional world modeling our data to make
analytic queries simple!
Wide rows are our friends
• Cassandra composite columns are powerful for analytic
models
• Facilitate multi-dimensional analysis
• A wide row table may have N number of rows, and a
variable number of columns (millions of columns)
• And now with CQL3 we have “unpacked” wide rows into
named columns  Easy to work with!
20130101 20130102 20130103 20130104 20130104 20130105 …
ClientA 10003 9493 43143 45553 54553 34343 …
ClientB 45453 34313 54543 `23233 4233 34423 …
ClientC 3323 35313 43123 54543 43433 4343 …
… … … … … … .. …
More about wide rows!
• The left-most column is the ROW KEY
• It is the mechanism by which the row is distributed across the Cassandra cluster…
• Care must be taken to prevent hot spots: Dates for example are not generally good
candidates because all load will go to given set of servers on a particular day!
• Data can be filtered using equal and “in” clause
• The top row is the COLUMN KEY
• Their can be a variable number of columns
• It is acceptable to have millions/ even billions of columns in a table
• Columns keys are sorted and can accept a range query (greater than / less than)
20130101 20130102 20130103 20130104 20130104 20130105 …
ClientA 10003 9493 43143 45553 54553 34343 …
ClientB 45453 34313 54543 `23233 4233 34423 …
ClientC 3323 35313 43123 54543 43433 4343 …
… … … … … … .. …
Create table Client_Daily_Summary (
Client text,
Date_ID int,
Trade_Count int,
Primary key (Client, Date_ID))
Traditional CassandraAnalytic Model
If we wanted to track trade count by day, hour we could
stream our ETL to two (or more) summary fact tables
0900 1000 1100 1200 1300 1400
ClientA|20131101 1000 949 4314 4555 5455 3434
ClientA|20131102 4545 3431 5454 2323 423 3442
ClientB|20131101 332 3531 4312 5454 4343 434
20130101 20130102 20130103 20130104 20130104 20130105
ClientA 10003 9493 43143 45553 54553 34343
ClientB 45453 34313 54543 `23233 4233 34423
ClientC 3323 35313 43123 54543 43433 4343
Sample analytic query: Give me daily trade counts for ClientA between Jan 1 and Jan 3:
Select Date_ID, Trade_Count from Client_Hourly_Summary `
where Client='ClientA' and Date_ID>=20130101 and Date_ID <=20130103
Sample analytic query: Give me hourly trade counts for ClientA for Jan1 between 9 and 11 AM
Select Hour, Trade_Count from Client_Hourly_Summary `
where Client_Date='ClientA|20131101' and hour >= 900 and <= 1100
Storing the Atomic data
8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 |
20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING |
59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 |
• We must land all atomic data:
• Persistence
• Future replay (new metrics, corrections)
• Drill down capabilities/auditability
• The sparse nature of the FIX data fits the Cassandra data model very
well.
• We will store tags which are actually present in the data, saving space  a few
approaches depending on usage pattern.
Create table Trades_Skinny(
OrderID Text Primary_Key,
Date_ID int,
Ticker int,
Client text,
…Many more columns)
Create index ix_Date_ID on
Trade_Data_Skinny (Date_ID)
Create table Trades_Map(
OrderID Text Primary_Key,
Date_ID int,
Ticker int,
Client text,
Tags map <text, text>)
Create index ix_Date_ID on
Trade_Data_Map (Date_ID)
Create table Trades_Wide(
Order_ID Text,
Tag text,
Value text,
Primary key (Order_ID, Tag))
Closing Thought
• The days of staying committed to the discipline of a
single database technology – Relational – Are behind
us.
• Polyglot Persistence – “where any decent sized
enterprise will have a variety of different data storage
technologies for different kinds of data. There will still
be large amounts of it managed in relational stores,
but increasingly we'll be first asking how we want to
manipulate the data and only then figuring out what
technology is the best bet for it.”
-- Martin Fowler
Recommended Reading http://lambda-architecture.net
Thank You
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta
www.casertaconcepts.com

More Related Content

What's hot

Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and CassandraDataStax Academy
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterDataStax Academy
 
Big data 101 for beginners riga dev days
Big data 101 for beginners riga dev daysBig data 101 for beginners riga dev days
Big data 101 for beginners riga dev daysDuyhai Doan
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time CassandraAcunu
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsOleg Magazov
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsAcunu
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraFolio3 Software
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandraAaron Ploetz
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxGetting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxData Con LA
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016DataStax
 
Kindling: Getting Started with Spark and Cassandra
Kindling: Getting Started with Spark and CassandraKindling: Getting Started with Spark and Cassandra
Kindling: Getting Started with Spark and CassandraDataStax Academy
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchMigration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchDataStax Academy
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
 

What's hot (20)

Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
 
Big data 101 for beginners riga dev days
Big data 101 for beginners riga dev daysBig data 101 for beginners riga dev days
Big data 101 for beginners riga dev days
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache Cassandra
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxGetting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of Datastax
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Kindling: Getting Started with Spark and Cassandra
Kindling: Getting Started with Spark and CassandraKindling: Getting Started with Spark and Cassandra
Kindling: Getting Started with Spark and Cassandra
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchMigration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a Hitch
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 

Viewers also liked

Status Quo on the automation support in SOA Suite OGhTech17
Status Quo on the automation support in SOA Suite OGhTech17Status Quo on the automation support in SOA Suite OGhTech17
Status Quo on the automation support in SOA Suite OGhTech17Jon Petter Hjulstad
 
Cleared Job Fair Job Seeker Handbook June 15, 2017, Dulles, VA
Cleared Job Fair Job Seeker Handbook June 15, 2017, Dulles, VACleared Job Fair Job Seeker Handbook June 15, 2017, Dulles, VA
Cleared Job Fair Job Seeker Handbook June 15, 2017, Dulles, VAClearedJobs.Net
 
Big data for cio 2015
Big data for cio 2015Big data for cio 2015
Big data for cio 2015Zohar Elkayam
 
“Ūdens resursi. Saglabāsim ūdeni kopā!” Pasaules lielākā mācību stunda Daugav...
“Ūdens resursi. Saglabāsim ūdeni kopā!” Pasaules lielākā mācību stunda Daugav...“Ūdens resursi. Saglabāsim ūdeni kopā!” Pasaules lielākā mācību stunda Daugav...
“Ūdens resursi. Saglabāsim ūdeni kopā!” Pasaules lielākā mācību stunda Daugav...liela_stunda
 
VMs All the Way Down (BSides Delaware 2016)
VMs All the Way Down (BSides Delaware 2016)VMs All the Way Down (BSides Delaware 2016)
VMs All the Way Down (BSides Delaware 2016)John Hubbard
 
(SEC320) Leveraging the Power of AWS to Automate Security & Compliance
(SEC320) Leveraging the Power of AWS to Automate Security & Compliance(SEC320) Leveraging the Power of AWS to Automate Security & Compliance
(SEC320) Leveraging the Power of AWS to Automate Security & ComplianceAmazon Web Services
 
Global Azure Bootcamp - Azure OMS
Global Azure Bootcamp - Azure OMSGlobal Azure Bootcamp - Azure OMS
Global Azure Bootcamp - Azure OMSBruno Lopes
 
General physicians and the adf Heddle
General physicians and the adf HeddleGeneral physicians and the adf Heddle
General physicians and the adf HeddleLeishman Associates
 
Delivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea UrsanerDelivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea UrsanerData Con LA
 
Business model cavans nl-sep-2014
Business model cavans nl-sep-2014Business model cavans nl-sep-2014
Business model cavans nl-sep-2014RolandSyntens
 
I1 - Securing Office 365 and Microsoft Azure like a rockstar (or like a group...
I1 - Securing Office 365 and Microsoft Azure like a rockstar (or like a group...I1 - Securing Office 365 and Microsoft Azure like a rockstar (or like a group...
I1 - Securing Office 365 and Microsoft Azure like a rockstar (or like a group...SPS Paris
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Lucidworks
 
SRE Study Notes - CH2,3,4
SRE Study Notes - CH2,3,4SRE Study Notes - CH2,3,4
SRE Study Notes - CH2,3,4Rick Hwang
 

Viewers also liked (20)

Understanding big data
Understanding big dataUnderstanding big data
Understanding big data
 
Status Quo on the automation support in SOA Suite OGhTech17
Status Quo on the automation support in SOA Suite OGhTech17Status Quo on the automation support in SOA Suite OGhTech17
Status Quo on the automation support in SOA Suite OGhTech17
 
Cleared Job Fair Job Seeker Handbook June 15, 2017, Dulles, VA
Cleared Job Fair Job Seeker Handbook June 15, 2017, Dulles, VACleared Job Fair Job Seeker Handbook June 15, 2017, Dulles, VA
Cleared Job Fair Job Seeker Handbook June 15, 2017, Dulles, VA
 
Big data for cio 2015
Big data for cio 2015Big data for cio 2015
Big data for cio 2015
 
“Ūdens resursi. Saglabāsim ūdeni kopā!” Pasaules lielākā mācību stunda Daugav...
“Ūdens resursi. Saglabāsim ūdeni kopā!” Pasaules lielākā mācību stunda Daugav...“Ūdens resursi. Saglabāsim ūdeni kopā!” Pasaules lielākā mācību stunda Daugav...
“Ūdens resursi. Saglabāsim ūdeni kopā!” Pasaules lielākā mācību stunda Daugav...
 
VMs All the Way Down (BSides Delaware 2016)
VMs All the Way Down (BSides Delaware 2016)VMs All the Way Down (BSides Delaware 2016)
VMs All the Way Down (BSides Delaware 2016)
 
Greach 2014 Sesamestreet Grails2 Workshop
Greach 2014 Sesamestreet Grails2 Workshop Greach 2014 Sesamestreet Grails2 Workshop
Greach 2014 Sesamestreet Grails2 Workshop
 
Bol.com
Bol.comBol.com
Bol.com
 
(SEC320) Leveraging the Power of AWS to Automate Security & Compliance
(SEC320) Leveraging the Power of AWS to Automate Security & Compliance(SEC320) Leveraging the Power of AWS to Automate Security & Compliance
(SEC320) Leveraging the Power of AWS to Automate Security & Compliance
 
Gastles PXL Hogeschool 2017
Gastles PXL Hogeschool 2017Gastles PXL Hogeschool 2017
Gastles PXL Hogeschool 2017
 
Water resources
Water resourcesWater resources
Water resources
 
Global Azure Bootcamp - Azure OMS
Global Azure Bootcamp - Azure OMSGlobal Azure Bootcamp - Azure OMS
Global Azure Bootcamp - Azure OMS
 
General physicians and the adf Heddle
General physicians and the adf HeddleGeneral physicians and the adf Heddle
General physicians and the adf Heddle
 
Delivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea UrsanerDelivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea Ursaner
 
Voetsporen 38
Voetsporen 38Voetsporen 38
Voetsporen 38
 
Business model cavans nl-sep-2014
Business model cavans nl-sep-2014Business model cavans nl-sep-2014
Business model cavans nl-sep-2014
 
I1 - Securing Office 365 and Microsoft Azure like a rockstar (or like a group...
I1 - Securing Office 365 and Microsoft Azure like a rockstar (or like a group...I1 - Securing Office 365 and Microsoft Azure like a rockstar (or like a group...
I1 - Securing Office 365 and Microsoft Azure like a rockstar (or like a group...
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
 
SRE Study Notes - CH2,3,4
SRE Study Notes - CH2,3,4SRE Study Notes - CH2,3,4
SRE Study Notes - CH2,3,4
 
Oracle Cloud Café IoT 12-APR-2016
Oracle Cloud Café IoT 12-APR-2016Oracle Cloud Café IoT 12-APR-2016
Oracle Cloud Café IoT 12-APR-2016
 

Similar to Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra

Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Caserta
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudAmazon Web Services
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsAbhishekKumarAgrahar2
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxPriyadarshini648418
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxAIMLSEMINARS
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
Big Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightBig Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightAmazon Web Services LATAM
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 

Similar to Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra (20)

Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Big Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightBig Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of Light
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 

More from Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 

More from Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra

  • 1. Low-Latency Analytics with NoSQL Joe Caserta June 18, 2014 Storm / Cassandra
  • 2. Quick Intro - Joe Caserta Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Dedicated to Data Warehousing, Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise Formalized Alliances / Partnerships – System Integrators Partnered with Big Data vendors Cloudera, HortonWorks, Datameer, more… Launched Training practice, teaching data concepts world-wide Laser focus on extending Data Warehouses with Big Data solutions 1986 2004 1996 2009 2001 2010 2013 Launched Big Data Warehousing Meetup in NYC – 950+ Members 2012 Established best practices for big data ecosystem implementation Listed as a Top 20 Data Analytics Consulting Companies - CIO Review
  • 3. Expertise & Offerings Strategic Roadmap / Assessment / Education / Implementation Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Big Data Analytics
  • 4. Client Portfolio Finance. Healthcare & Insurance Retail/eCommerce & Manufacturing Education & Services
  • 5. Listed as one of the 20 Most Promising Data Analytics Consulting Companies CIOReview looked at hundreds of data analytics consulting companies and shortlisted the ones who are at the forefront of tackling the real analytics challenges. A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial board of CIOReview selected the Final 20. Caserta Concepts
  • 6. Sales Marketing Finance ETL Ad-Hoc Query Horizontally Scalable Environment - Optimized for Analytics Big Data Cluster Canned Reporting Big Data Analytics NoSQL Databases ETL Ad-Hoc/Canned Reporting Traditional BI Mahout MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Others… Why is Data Analytics so important? Data Science Enterprise Data Warehouse
  • 7. • Data is coming in so fast, how do we monitor it? • Real real-time analytics • Relevance engines, financial fraud sensors, early warning sensors • Dealing with sparse, incomplete, volatile, and highly manufactured data • Agile to adapt quickly to changing business • Wider breadth of datasets and sources in scope requires larger data repositories • Most of world’s data is unstructured, semi-structured or multi-structured • Data volume is growing so processes must be more reliant on programmatic administration • Less people/process dependence Volume Variety VelocityVeracity Challenges With Big Data
  • 8.
  • 9. What’s Important Today (according to Joe) Hadoop Distribution: Cloudera, MapR, Hortonworks, Pivotal-HD  Tools:  Mahout: Machine learning  Hive: Map data to structures and use SQL-like queries  Pig: Data transformation language for big data, from Yahoo  Storm: Real-time ETL NoSQL: Document: MongoDB, CouchDB Graph: Neo4j, Titan Key Value: Riak, Redis Columnar: Cassandra, Hbase  Languages: SQL, Python, SciPy, Java  Predictive Modeling: R, SAS, SPSS
  • 10. Why talk about Storm & Cassandra? ERP Finance Legacy ETL Data Analytics Horizontally Scalable Environment - Optimized for Analytics Big Data Cluster Data Science Big Data BI NoSQL Database ETL Ad-Hoc/Canned Reporting Traditional BI Mahout MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Storm
  • 11. High Volume Ingestion Project Overview • The equity trading arm of a large US bank needed to scale its infrastructure to enable the ability to process/parse trade data real-time and calculate aggregations/statistics ~ 1Million/second ~12 Billion messages/day ~240 Billon/month • The solution needed to map the raw data to a data model in memory or low latency (for real-time), while persisting mapped data to disk (for end of day). • The proposed solution also needed to handle ad-hoc data requests for data analytics.
  • 12. The Data • Primarily FIX messages: Financial Information Exchange  • Established in early 90's as a standard for trade data communication  widely used throughout the industry • Basically a delimited file of variable attribute-value pairs • Looks something like this: 8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 | • A single trade can be comprised of 1000's of such messages, although typical trades have about a dozen
  • 13. Additional Requirements • Linearly scalable • Highly available  no single point of failure ,quick recovery • Quicker time to benefit • Processing guarantees  NO DATA IS LOST!
  • 14. Some Sample Analytic Use Cases • Sum(Notional volume) by Ticker: Daily, Hourly, Minute • Average trade latency (Execution TS – Order TS) • Wash Sales (sell within x seconds of last buy) for same Client/Ticker
  • 15. NoSQL Database (Cassandra) Messaging Messaging Aggregation Framework Storm Ingestand Computation External Data Real-time Dashboards Push Aggregates To MQ Stream transaction Detail Higher Latency Analytics And Day End Application Log Data High Volume Real-timeAnalytics - SolutionArchitecture
  • 16. A little deeper… Storm Cluster Sensor Data d3.js Analytics Hadoop Cluster Low Latency Analytics Atomic data Aggregates Event Monitors • The Kafka messaging system is used for ingestion • Storm is used for real-time ETL and outputs atomic data and derived data needed for analytics • Redis is used as a reference data lookup cache • Real time analytics are produced from the aggregated data. • Higher latency ad-hoc analytics are done in Hadoop using Pig and Hive Kafka
  • 17. What is Storm • Distributed Event Processor • Real-time data ingestion and dissemination • In-Stream ETL • Reliably process unbounded streams of data • Storm is fast: Clocked it at over a million tuples per second per node • It is scalable, fault-tolerant, guarantees your data will be processed • Preferred technology for real-time big data processing by organizations worldwide: • Partial list at https://github.com/nathanmarz/storm/wiki/Powered-By • Incubator: • http://wiki.apache.org/incubator/StormProposal
  • 18. Components of Storm • Spout – Collects data from upstream feeds and submits it for processing • Tuple – A collection of data that is passed within Storm • Bolt – Processes tuples (Transformations) • Stream – Identifies outputs from Spouts/Bolts • Storm usually outputs to a NoSQL database
  • 19. Why NoSQL? • Performance: • Relational databases have a lot of features, overhead that we don’t need in many cases. Although we will miss some… • Scalability: • Most relational databases scale vertically giving them limits to how large they can get. Federation and Sharding is an awkward manual process. • Agile • Sparse Data / Data with a lot of variation • Most NoSQL scale horizontally on commodity hardware
  • 20. • Column families are the equivalent to a table in a RDMS • Primary unit of storage is a column, they are stored contiguously Skinny Rows: Most like relational database. Except columns are optional and not stored if omitted: Wide Rows: Rows can be billions of columns wide, used for time series, relationships, secondary indexes: What is Cassandra?
  • 21. Deeper Dive: Cassandra as an Analytic Database • Based on a blend of Dynamo and BigTable • Distributed, master-less • Super fast writes  Can ingest lots of data! • Very fast reads Why did we choose it: • Data throughput requirements • High availability • Simple expansion • Interesting data models for time series data (more on this later)
  • 22. Design Practices • Cassandra does not support aggregation or joins  Data model must be tuned to usage • Denormalize your data (flatten your primary dimensional attributes into your fact) • Storing the same data redundantly is OK Might sound weird but we've been doing this all along in the traditional world modeling our data to make analytic queries simple!
  • 23. Wide rows are our friends • Cassandra composite columns are powerful for analytic models • Facilitate multi-dimensional analysis • A wide row table may have N number of rows, and a variable number of columns (millions of columns) • And now with CQL3 we have “unpacked” wide rows into named columns  Easy to work with! 20130101 20130102 20130103 20130104 20130104 20130105 … ClientA 10003 9493 43143 45553 54553 34343 … ClientB 45453 34313 54543 `23233 4233 34423 … ClientC 3323 35313 43123 54543 43433 4343 … … … … … … … .. …
  • 24. More about wide rows! • The left-most column is the ROW KEY • It is the mechanism by which the row is distributed across the Cassandra cluster… • Care must be taken to prevent hot spots: Dates for example are not generally good candidates because all load will go to given set of servers on a particular day! • Data can be filtered using equal and “in” clause • The top row is the COLUMN KEY • Their can be a variable number of columns • It is acceptable to have millions/ even billions of columns in a table • Columns keys are sorted and can accept a range query (greater than / less than) 20130101 20130102 20130103 20130104 20130104 20130105 … ClientA 10003 9493 43143 45553 54553 34343 … ClientB 45453 34313 54543 `23233 4233 34423 … ClientC 3323 35313 43123 54543 43433 4343 … … … … … … … .. … Create table Client_Daily_Summary ( Client text, Date_ID int, Trade_Count int, Primary key (Client, Date_ID))
  • 25. Traditional CassandraAnalytic Model If we wanted to track trade count by day, hour we could stream our ETL to two (or more) summary fact tables 0900 1000 1100 1200 1300 1400 ClientA|20131101 1000 949 4314 4555 5455 3434 ClientA|20131102 4545 3431 5454 2323 423 3442 ClientB|20131101 332 3531 4312 5454 4343 434 20130101 20130102 20130103 20130104 20130104 20130105 ClientA 10003 9493 43143 45553 54553 34343 ClientB 45453 34313 54543 `23233 4233 34423 ClientC 3323 35313 43123 54543 43433 4343 Sample analytic query: Give me daily trade counts for ClientA between Jan 1 and Jan 3: Select Date_ID, Trade_Count from Client_Hourly_Summary ` where Client='ClientA' and Date_ID>=20130101 and Date_ID <=20130103 Sample analytic query: Give me hourly trade counts for ClientA for Jan1 between 9 and 11 AM Select Hour, Trade_Count from Client_Hourly_Summary ` where Client_Date='ClientA|20131101' and hour >= 900 and <= 1100
  • 26. Storing the Atomic data 8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 | • We must land all atomic data: • Persistence • Future replay (new metrics, corrections) • Drill down capabilities/auditability • The sparse nature of the FIX data fits the Cassandra data model very well. • We will store tags which are actually present in the data, saving space  a few approaches depending on usage pattern. Create table Trades_Skinny( OrderID Text Primary_Key, Date_ID int, Ticker int, Client text, …Many more columns) Create index ix_Date_ID on Trade_Data_Skinny (Date_ID) Create table Trades_Map( OrderID Text Primary_Key, Date_ID int, Ticker int, Client text, Tags map <text, text>) Create index ix_Date_ID on Trade_Data_Map (Date_ID) Create table Trades_Wide( Order_ID Text, Tag text, Value text, Primary key (Order_ID, Tag))
  • 27. Closing Thought • The days of staying committed to the discipline of a single database technology – Relational – Are behind us. • Polyglot Persistence – “where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” -- Martin Fowler
  • 29. Thank You Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta www.casertaconcepts.com

Editor's Notes

  1. Alternative NoSQL: Hbase, Cassandra, Druid, VoltDB