Businesses are generating and ingesting an unprecedented volume of structured and unstructured data to be analyzed. Needed is a scalable Big Data infrastructure that processes and parses extremely high volume in real-time and calculates aggregations and statistics. Banking trade data where volumes can exceed billions of messages a day is a perfect example.
Firms are fast approaching 'the wall' in terms of scalability with relational databases, and must stop imposing relational structure on analytics data and map raw trade data to a data model in low latency, preserve the mapped data to disk, and handle ad-hoc data requests for data analytics.
Joe discusses and introduces NoSQL databases, describing how they are capable of scaling far beyond relational databases while maintaining performance , and shares a real-world case study that details the architecture and technologies needed to ingest high-volume data for real-time analytics.
For more information, visit www.casertaconcepts.com
2. Quick Intro - Joe Caserta
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)
Dedicated to Data Warehousing,
Business Intelligence since 1996
Began consulting database
programing and data modeling
25+ years hands-on experience
building database solutions
Founded Caserta Concepts in NYC
Web log analytics solution published
in Intelligent Enterprise
Formalized Alliances / Partnerships –
System Integrators
Partnered with Big Data vendors
Cloudera, HortonWorks, Datameer,
more…
Launched Training practice, teaching
data concepts world-wide
Laser focus on extending Data
Warehouses with Big Data solutions
1986
2004
1996
2009
2001
2010
2013
Launched Big Data Warehousing
Meetup in NYC – 950+ Members
2012
Established best practices for big
data ecosystem implementation
Listed as a Top 20 Data Analytics
Consulting Companies - CIO Review
3. Expertise & Offerings
Strategic Roadmap /
Assessment / Education /
Implementation
Data Warehousing/
ETL/Data Integration
BI/Visualization/
Analytics
Big Data
Analytics
5. Listed as one of the 20 Most Promising
Data Analytics Consulting Companies
CIOReview looked at hundreds of data analytics consulting companies and shortlisted
the ones who are at the forefront of tackling the real analytics challenges.
A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial
board of CIOReview selected the Final 20.
Caserta Concepts
6. Sales
Marketing
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Big Data Cluster
Canned Reporting
Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Mahout MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Others…
Why is Data Analytics so important?
Data Science
Enterprise
Data Warehouse
7. • Data is coming in so
fast, how do we monitor
it?
• Real real-time analytics
• Relevance engines,
financial fraud sensors,
early warning sensors
• Dealing with sparse,
incomplete, volatile,
and highly
manufactured data
• Agile to adapt quickly
to changing business
• Wider breadth of
datasets and sources in
scope requires larger
data repositories
• Most of world’s data is
unstructured,
semi-structured or
multi-structured
• Data volume is
growing so
processes must be
more reliant on
programmatic
administration
• Less people/process
dependence
Volume Variety
VelocityVeracity
Challenges With Big Data
8.
9. What’s Important Today (according to Joe)
Hadoop Distribution: Cloudera, MapR, Hortonworks, Pivotal-HD
Tools:
Mahout: Machine learning
Hive: Map data to structures and use SQL-like queries
Pig: Data transformation language for big data, from Yahoo
Storm: Real-time ETL
NoSQL:
Document: MongoDB, CouchDB
Graph: Neo4j, Titan
Key Value: Riak, Redis
Columnar: Cassandra, Hbase
Languages: SQL, Python, SciPy, Java
Predictive Modeling: R, SAS, SPSS
10. Why talk about Storm & Cassandra?
ERP
Finance
Legacy
ETL
Data Analytics
Horizontally Scalable Environment - Optimized for Analytics
Big Data Cluster
Data Science
Big Data BI
NoSQL
Database
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Mahout MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Storm
11. High Volume Ingestion Project Overview
• The equity trading arm of a large US bank needed to
scale its infrastructure to enable the ability to
process/parse trade data real-time and calculate
aggregations/statistics
~ 1Million/second ~12 Billion messages/day ~240 Billon/month
• The solution needed to map the raw data to a data model
in memory or low latency (for real-time), while persisting
mapped data to disk (for end of day).
• The proposed solution also needed to
handle ad-hoc data requests for data
analytics.
12. The Data
• Primarily FIX messages: Financial Information Exchange
• Established in early 90's as a standard for trade data
communication widely used throughout the industry
• Basically a delimited file of variable attribute-value pairs
• Looks something like this:
8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 |
11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 |
44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 |
10=128 |
• A single trade can be comprised of 1000's of such messages,
although typical trades have about a dozen
13. Additional Requirements
• Linearly scalable
• Highly available no single point of failure ,quick recovery
• Quicker time to benefit
• Processing guarantees NO DATA IS LOST!
14. Some Sample Analytic Use Cases
• Sum(Notional volume) by Ticker: Daily, Hourly, Minute
• Average trade latency (Execution TS – Order TS)
• Wash Sales (sell within x seconds of last buy) for same
Client/Ticker
16. A little deeper…
Storm Cluster
Sensor
Data
d3.js Analytics
Hadoop Cluster
Low Latency
Analytics
Atomic data
Aggregates
Event Monitors
• The Kafka messaging system is used for ingestion
• Storm is used for real-time ETL and outputs atomic data
and derived data needed for analytics
• Redis is used as a reference data lookup cache
• Real time analytics are produced from the aggregated
data.
• Higher latency ad-hoc analytics are done in Hadoop
using Pig and Hive
Kafka
17. What is Storm
• Distributed Event Processor
• Real-time data ingestion and dissemination
• In-Stream ETL
• Reliably process unbounded streams of data
• Storm is fast: Clocked it at over a million tuples per second per node
• It is scalable, fault-tolerant, guarantees your data will be processed
• Preferred technology for real-time big data processing by organizations
worldwide:
• Partial list at https://github.com/nathanmarz/storm/wiki/Powered-By
• Incubator:
• http://wiki.apache.org/incubator/StormProposal
18. Components of Storm
• Spout – Collects data from upstream feeds and submits
it for processing
• Tuple – A collection of data that is passed within Storm
• Bolt – Processes tuples (Transformations)
• Stream – Identifies outputs from Spouts/Bolts
• Storm usually outputs to a NoSQL database
19. Why NoSQL?
• Performance:
• Relational databases have a lot of features, overhead that we don’t
need in many cases. Although we will miss some…
• Scalability:
• Most relational databases scale vertically giving them limits to how
large they can get. Federation and Sharding is an awkward manual
process.
• Agile
• Sparse Data / Data with a lot of variation
• Most NoSQL scale horizontally on commodity hardware
20. • Column families are the equivalent to a table in a RDMS
• Primary unit of storage is a column, they are stored
contiguously
Skinny Rows: Most like relational database. Except
columns are optional and not stored if omitted:
Wide Rows: Rows can be billions of columns wide, used
for time series, relationships, secondary indexes:
What is Cassandra?
21. Deeper Dive: Cassandra as an Analytic
Database
• Based on a blend of Dynamo and BigTable
• Distributed, master-less
• Super fast writes Can ingest lots of data!
• Very fast reads
Why did we choose it:
• Data throughput requirements
• High availability
• Simple expansion
• Interesting data models for time series data (more on this
later)
22. Design Practices
• Cassandra does not support aggregation or joins
Data model must be tuned to usage
• Denormalize your data (flatten your primary dimensional
attributes into your fact)
• Storing the same data redundantly is OK
Might sound weird but we've been doing this all along
in the traditional world modeling our data to make
analytic queries simple!
23. Wide rows are our friends
• Cassandra composite columns are powerful for analytic
models
• Facilitate multi-dimensional analysis
• A wide row table may have N number of rows, and a
variable number of columns (millions of columns)
• And now with CQL3 we have “unpacked” wide rows into
named columns Easy to work with!
20130101 20130102 20130103 20130104 20130104 20130105 …
ClientA 10003 9493 43143 45553 54553 34343 …
ClientB 45453 34313 54543 `23233 4233 34423 …
ClientC 3323 35313 43123 54543 43433 4343 …
… … … … … … .. …
24. More about wide rows!
• The left-most column is the ROW KEY
• It is the mechanism by which the row is distributed across the Cassandra cluster…
• Care must be taken to prevent hot spots: Dates for example are not generally good
candidates because all load will go to given set of servers on a particular day!
• Data can be filtered using equal and “in” clause
• The top row is the COLUMN KEY
• Their can be a variable number of columns
• It is acceptable to have millions/ even billions of columns in a table
• Columns keys are sorted and can accept a range query (greater than / less than)
20130101 20130102 20130103 20130104 20130104 20130105 …
ClientA 10003 9493 43143 45553 54553 34343 …
ClientB 45453 34313 54543 `23233 4233 34423 …
ClientC 3323 35313 43123 54543 43433 4343 …
… … … … … … .. …
Create table Client_Daily_Summary (
Client text,
Date_ID int,
Trade_Count int,
Primary key (Client, Date_ID))
25. Traditional CassandraAnalytic Model
If we wanted to track trade count by day, hour we could
stream our ETL to two (or more) summary fact tables
0900 1000 1100 1200 1300 1400
ClientA|20131101 1000 949 4314 4555 5455 3434
ClientA|20131102 4545 3431 5454 2323 423 3442
ClientB|20131101 332 3531 4312 5454 4343 434
20130101 20130102 20130103 20130104 20130104 20130105
ClientA 10003 9493 43143 45553 54553 34343
ClientB 45453 34313 54543 `23233 4233 34423
ClientC 3323 35313 43123 54543 43433 4343
Sample analytic query: Give me daily trade counts for ClientA between Jan 1 and Jan 3:
Select Date_ID, Trade_Count from Client_Hourly_Summary `
where Client='ClientA' and Date_ID>=20130101 and Date_ID <=20130103
Sample analytic query: Give me hourly trade counts for ClientA for Jan1 between 9 and 11 AM
Select Hour, Trade_Count from Client_Hourly_Summary `
where Client_Date='ClientA|20131101' and hour >= 900 and <= 1100
26. Storing the Atomic data
8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 |
20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING |
59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 |
• We must land all atomic data:
• Persistence
• Future replay (new metrics, corrections)
• Drill down capabilities/auditability
• The sparse nature of the FIX data fits the Cassandra data model very
well.
• We will store tags which are actually present in the data, saving space a few
approaches depending on usage pattern.
Create table Trades_Skinny(
OrderID Text Primary_Key,
Date_ID int,
Ticker int,
Client text,
…Many more columns)
Create index ix_Date_ID on
Trade_Data_Skinny (Date_ID)
Create table Trades_Map(
OrderID Text Primary_Key,
Date_ID int,
Ticker int,
Client text,
Tags map <text, text>)
Create index ix_Date_ID on
Trade_Data_Map (Date_ID)
Create table Trades_Wide(
Order_ID Text,
Tag text,
Value text,
Primary key (Order_ID, Tag))
27. Closing Thought
• The days of staying committed to the discipline of a
single database technology – Relational – Are behind
us.
• Polyglot Persistence – “where any decent sized
enterprise will have a variety of different data storage
technologies for different kinds of data. There will still
be large amounts of it managed in relational stores,
but increasingly we'll be first asking how we want to
manipulate the data and only then figuring out what
technology is the best bet for it.”
-- Martin Fowler