SlideShare a Scribd company logo
1 of 42
Data Platform at Twitter
Enabling Real-time & Batch Analytics at Scale
Sriram Krishnan | Twitter | @krishnansriram
Hadoop Innovation Summit, Feb 12, 2015
Who am I?
• Engineering Manager, Data Platform at Twitter
• Formerly:
• Tech Lead, Big Data Platform at Netflix
• Group Lead & Senior Researcher at the San
Diego Supercomputer Center
Twitter Scale (Feb 2015)
• More than 288M monthly
active users
• More than 500M unique
monthly logged out users
• 500M tweets per day
• 80% mobile users
• Thousands of advertisers
Twitter Data Platform
• Enables use of large-scale resources to perform data
analytics at scale
• trending tweets
• ad impressions
• most clicked, most followed, most retweeted
• platform health & statistics
• experimentation
Twitter Data Platform
• Enables use of large-scale resources to perform data
analytics at scale
• trending tweets
• ad impressions
• most clicked, most followed, most retweeted
• platform health & statistics
• experimentation
In real-time!
Use Case: Counting Impressions
Data Streams
(user_id, tweet, timestamp, hashtags, …)
(user_id, tweet, device_id, …)
event queue
Logs
Storm topology
Hadoop job
Real-time
results
Batch
results
Storm
• Streaming compute framework
• Jobs represented as
topologies
• Process tuples as they come
• Data comes in from Spouts
• Data is computed on in Bolts
http://storm.apache.org
Hadoop
http://www.glennklockwood.com/di/hadoop-overview.php
• Batch MapReduce library
• Uses the Hadoop
Distributed File System
(HDFS) to store and
process data
• Jobs consist of Map &
Reduce phases
• Synchronization barriers
between each stage
Challenges
• Hadoop and Storm present two different
programming models
• Written in different languages
• Each job has to be written twice
• Hard to optimize, specific for each job & platform
• Hard to compute complete up-to-the-moment
information
Goals
• Write job once
Goals
• Write job once!
• Portable
Goals
• Write job once!
• Portable!
• Fault tolerant
Goals
• Write job once!
• Portable!
• Fault tolerant!
• Up-to-the-moment
TSAR
• The TimeSeries AggregatoR
• https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
We are still counting impressions
aggregate {
onKeys (
(TweetId)
) produce (
Count
) sinkTo (
Vertica
)
} fromProducer {
!
ClientEventSource(“client_events”)
.filter { e => isImpressionEvent(e) }
.map { e =>
val impr = Impression(e.tweetId)
(e.timestamp, impr)
}
}
The TSAR job
Dimensions
Metrics
Data Sinks
Data Sources
TSAR will then:
• Configure and launch jobs on Storm & Hadoop
• Create services for querying results
• Create tables and views and staging jobs
• Create alerts and observability graphs
Behind the scenes
Scalding Storm
Batch (Hadoop)!
Better accuracy!
Worse latency
Realtime!
Better latency!
Worse accuracy
TSAR
Behind the scenes
Scalding Storm
Batch (Hadoop)!
Better accuracy!
Worse latency
Realtime!
Better latency!
Worse accuracy
TSAR Vertica
Manual Data Exploration
Manhattan
Dashboards & !
Production Services
Behind the scenes
Summingbird
Scalding Storm
Batch (Hadoop)!
Better accuracy!
Worse latency
Realtime!
Better latency!
Worse accuracy
TSAR Vertica
Manual Data Exploration
Manhattan
Dashboards & !
Production Services
Glossary
• Summingbird - framework for integrating batch
and online MapReduce computations
• Scalding - Scala library for running batch
MapReduce jobs
• Manhattan - real-time multi-tenant distributed
database
What is Summingbird?
1) Model for streaming
multi-stage map-reduce
What is streaming map-reduce?
Service
Merge
SumByKey
Map
Map
Lookup
Source Source
!
Can push one
tuple through at a
time to update
state
!
=> can work on
batch and real-
time streams of
dataStore
One-at-a-time semantics, run
the job in realtime or in batch
What is Summingbird?
2) Implementations to run
on Storm, Scalding
(Hadoop), Spark, etc
What is Summingbird?
2) Implementations to run
on Storm, Scalding
(Hadoop), Spark, etc
Portable
Optimizers at the
Summingbird layer,
leverage those optimizers
across platforms
What is Summingbird?
3) Systematic
implementation of the
“Lambda Architecture”
summingbird-scalding
summingbird-storm
storehaus-memcache
storehaus-algebra
storehaus-manhattan
Kafka
Lambda Architecture with Summingbird
http://lambda-architecture.net
What is Summingbird?
3) Systematic
implementation of the
“Lambda Architecture”
Fault Tolerant
What is Summingbird?
3) Systematic
implementation of the
“Lambda Architecture”
Fault Tolerant
Up-to-the-moment
Restricting reduce operators
to a very general class
(semigroups, monoids)system.
Monoid
2 + 3 = 61 +
Monoid
2 + 3 = 61 +
=
3
Monoid
2 + 3 = 61 +
=
5
Example Monoids
• (a min b) min c = a min (b min c)
• (a max b) max c = a max (b max c)
• addition: (a + b) + c = a + (b + c)
• set union: (a u b) u c = a u (b u c)
• set intersection: (a n b) n c = a n (b n c)
• approximate unique count (HLL), approximate
counter (CMS)
!
!
Algebird - https://github.com/twitter/algebird
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
Slow but
fault
tolerant
Noisy
but fast
Realtime
sums from
0, each
batch
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RTRT RT RT RT
Hadoop
keeps a total
sum (reliably)
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RTRT RT RT RT Sum of RT
Batch(i) +
Hadoop
Batch(i-1)
has bounded
noise, bounded
read/write size.
!
Done at query
time
All of that sums up to
Summary
• Twitter Data Platform enables use of large-scale
resources to perform data analytics at scale
• Write jobs once, that are
• Portable, reliable, and up-to-the-moment
• A systematic implementation of the lambda
architecture
• Monoids & semi-groups for parallelism & performance
• Batching & associativity for reliability
Thank you!
@krishnansriram
Acknowledgements!
Ekaterina Gonina
Ian O’Connell
Gabriel Gonzalez
Oscar Boykin
!
Twitter Data Platform Team!
We are hiring!!
!
@summingbird
https://github.com/twitter/summingbird
@scalding
https://github.com/twitter/scalding

More Related Content

What's hot

Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 

What's hot (20)

Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
ebay
ebayebay
ebay
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 

Viewers also liked

Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Sriram Krishnan
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Sid Anand
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
Springwise Top Digital Innovations 2015
Springwise Top Digital Innovations 2015Springwise Top Digital Innovations 2015
Springwise Top Digital Innovations 2015Springwise
 
Hourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes
 
BigData Overview
BigData OverviewBigData Overview
BigData OverviewHoryun Lee
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015Kurt Brown
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)Eva Tse
 
Data Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixData Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixKurt Brown
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute modelDean Wampler
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineeringnathanmarz
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
 

Viewers also liked (20)

Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Springwise Top Digital Innovations 2015
Springwise Top Digital Innovations 2015Springwise Top Digital Innovations 2015
Springwise Top Digital Innovations 2015
 
Hourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on Hadoop
 
BigData Overview
BigData OverviewBigData Overview
BigData Overview
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
 
Data Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixData Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at Netflix
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Logging & Metrics
Logging & Metrics  Logging & Metrics
Logging & Metrics
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
Data-Driven @ Netflix
Data-Driven @ NetflixData-Driven @ Netflix
Data-Driven @ Netflix
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 

Similar to Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale

TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkAnirudh Todi
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure TechnologiesKoray Kocabas
 
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysSmartNews, Inc.
 
Timeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaOCoderFest
 
Reference architecture for Internet of Things
Reference architecture for Internet of ThingsReference architecture for Internet of Things
Reference architecture for Internet of ThingsSujee Maniyam
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Thingselephantscale
 
IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Int...
IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Int...IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Int...
IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Int...In-Memory Computing Summit
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Stratio
 
Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...
Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...
Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...Amazon Web Services
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Eric Sammer
 
Monitoring as Software Validation
Monitoring as Software ValidationMonitoring as Software Validation
Monitoring as Software ValidationBioDec
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
 
The Incremental Path to Observability
The Incremental Path to ObservabilityThe Incremental Path to Observability
The Incremental Path to ObservabilityEmily Nakashima
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsthelabdude
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer confluent
 
Getting Started with Real-Time Analytics
Getting Started with Real-Time AnalyticsGetting Started with Real-Time Analytics
Getting Started with Real-Time AnalyticsAmazon Web Services
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 

Similar to Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale (20)

Tsar tech talk
Tsar tech talkTsar tech talk
Tsar tech talk
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure Technologies
 
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdays
 
Timeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in Grafana
 
Reference architecture for Internet of Things
Reference architecture for Internet of ThingsReference architecture for Internet of Things
Reference architecture for Internet of Things
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Things
 
IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Int...
IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Int...IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Int...
IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Int...
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming
 
Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...
Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...
Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...
 
Monitoring as Software Validation
Monitoring as Software ValidationMonitoring as Software Validation
Monitoring as Software Validation
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
The Incremental Path to Observability
The Incremental Path to ObservabilityThe Incremental Path to Observability
The Incremental Path to Observability
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer
 
Getting Started with Real-Time Analytics
Getting Started with Real-Time AnalyticsGetting Started with Real-Time Analytics
Getting Started with Real-Time Analytics
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 

Recently uploaded

Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Recently uploaded (20)

Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale

  • 1. Data Platform at Twitter Enabling Real-time & Batch Analytics at Scale Sriram Krishnan | Twitter | @krishnansriram Hadoop Innovation Summit, Feb 12, 2015
  • 2. Who am I? • Engineering Manager, Data Platform at Twitter • Formerly: • Tech Lead, Big Data Platform at Netflix • Group Lead & Senior Researcher at the San Diego Supercomputer Center
  • 3. Twitter Scale (Feb 2015) • More than 288M monthly active users • More than 500M unique monthly logged out users • 500M tweets per day • 80% mobile users • Thousands of advertisers
  • 4. Twitter Data Platform • Enables use of large-scale resources to perform data analytics at scale • trending tweets • ad impressions • most clicked, most followed, most retweeted • platform health & statistics • experimentation
  • 5. Twitter Data Platform • Enables use of large-scale resources to perform data analytics at scale • trending tweets • ad impressions • most clicked, most followed, most retweeted • platform health & statistics • experimentation In real-time!
  • 6. Use Case: Counting Impressions
  • 7. Data Streams (user_id, tweet, timestamp, hashtags, …) (user_id, tweet, device_id, …) event queue Logs Storm topology Hadoop job Real-time results Batch results
  • 8. Storm • Streaming compute framework • Jobs represented as topologies • Process tuples as they come • Data comes in from Spouts • Data is computed on in Bolts http://storm.apache.org
  • 9. Hadoop http://www.glennklockwood.com/di/hadoop-overview.php • Batch MapReduce library • Uses the Hadoop Distributed File System (HDFS) to store and process data • Jobs consist of Map & Reduce phases • Synchronization barriers between each stage
  • 10. Challenges • Hadoop and Storm present two different programming models • Written in different languages • Each job has to be written twice • Hard to optimize, specific for each job & platform • Hard to compute complete up-to-the-moment information
  • 12. Goals • Write job once! • Portable
  • 13. Goals • Write job once! • Portable! • Fault tolerant
  • 14. Goals • Write job once! • Portable! • Fault tolerant! • Up-to-the-moment
  • 15. TSAR • The TimeSeries AggregatoR • https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
  • 16. We are still counting impressions
  • 17. aggregate { onKeys ( (TweetId) ) produce ( Count ) sinkTo ( Vertica ) } fromProducer { ! ClientEventSource(“client_events”) .filter { e => isImpressionEvent(e) } .map { e => val impr = Impression(e.tweetId) (e.timestamp, impr) } } The TSAR job Dimensions Metrics Data Sinks Data Sources
  • 18. TSAR will then: • Configure and launch jobs on Storm & Hadoop • Create services for querying results • Create tables and views and staging jobs • Create alerts and observability graphs
  • 19. Behind the scenes Scalding Storm Batch (Hadoop)! Better accuracy! Worse latency Realtime! Better latency! Worse accuracy TSAR
  • 20. Behind the scenes Scalding Storm Batch (Hadoop)! Better accuracy! Worse latency Realtime! Better latency! Worse accuracy TSAR Vertica Manual Data Exploration Manhattan Dashboards & ! Production Services
  • 21. Behind the scenes Summingbird Scalding Storm Batch (Hadoop)! Better accuracy! Worse latency Realtime! Better latency! Worse accuracy TSAR Vertica Manual Data Exploration Manhattan Dashboards & ! Production Services
  • 22. Glossary • Summingbird - framework for integrating batch and online MapReduce computations • Scalding - Scala library for running batch MapReduce jobs • Manhattan - real-time multi-tenant distributed database
  • 23. What is Summingbird? 1) Model for streaming multi-stage map-reduce
  • 24. What is streaming map-reduce? Service Merge SumByKey Map Map Lookup Source Source ! Can push one tuple through at a time to update state ! => can work on batch and real- time streams of dataStore
  • 25. One-at-a-time semantics, run the job in realtime or in batch
  • 26. What is Summingbird? 2) Implementations to run on Storm, Scalding (Hadoop), Spark, etc
  • 27. What is Summingbird? 2) Implementations to run on Storm, Scalding (Hadoop), Spark, etc Portable
  • 28. Optimizers at the Summingbird layer, leverage those optimizers across platforms
  • 29. What is Summingbird? 3) Systematic implementation of the “Lambda Architecture”
  • 31. What is Summingbird? 3) Systematic implementation of the “Lambda Architecture” Fault Tolerant
  • 32. What is Summingbird? 3) Systematic implementation of the “Lambda Architecture” Fault Tolerant Up-to-the-moment
  • 33. Restricting reduce operators to a very general class (semigroups, monoids)system.
  • 34. Monoid 2 + 3 = 61 +
  • 35. Monoid 2 + 3 = 61 + = 3
  • 36. Monoid 2 + 3 = 61 + = 5
  • 37. Example Monoids • (a min b) min c = a min (b min c) • (a max b) max c = a max (b max c) • addition: (a + b) + c = a + (b + c) • set union: (a u b) u c = a u (b u c) • set intersection: (a n b) n c = a n (b n c) • approximate unique count (HLL), approximate counter (CMS) ! ! Algebird - https://github.com/twitter/algebird
  • 38. Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 Slow but fault tolerant Noisy but fast Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RTRT RT RT RT Hadoop keeps a total sum (reliably)
  • 39. Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RTRT RT RT RT Sum of RT Batch(i) + Hadoop Batch(i-1) has bounded noise, bounded read/write size. ! Done at query time
  • 40. All of that sums up to
  • 41. Summary • Twitter Data Platform enables use of large-scale resources to perform data analytics at scale • Write jobs once, that are • Portable, reliable, and up-to-the-moment • A systematic implementation of the lambda architecture • Monoids & semi-groups for parallelism & performance • Batching & associativity for reliability
  • 42. Thank you! @krishnansriram Acknowledgements! Ekaterina Gonina Ian O’Connell Gabriel Gonzalez Oscar Boykin ! Twitter Data Platform Team! We are hiring!! ! @summingbird https://github.com/twitter/summingbird @scalding https://github.com/twitter/scalding