SlideShare a Scribd company logo
1 of 46
Building
reactive real-time
data pipeline at FPT
by @tantrieuf31
Basic info
◦ I work at FPT Telecom, tech lead of the team
FA2
(Rich Media Analytics and Advertising)
◦ My focus about data infrastructure, backend
system and reactive system architecture.
◦ My blog at nguyentantrieu.info/blog
◦ Founding a small Data Lab at mc2ads.org and
has started R&D about Fast Data since 2014
Note:
◦ All contents and thoughts in this slide are my
subjective ideas and work experience.
◦ The slide is written for meetup event at
grokkingengineering.org
Contents
1. What is “Data Pipeline” ?
2. Big Data Problems at FPT
a. VnExpress: pageview and heat-map
b. eClick: real-time reactive advertising
3. Solutions and Patterns
4. Fast Data Architecture at FPT
5. Wrap up
“
1) What is “Data Pipeline” ?
“Intelligence is not only the ability to
reason; it is also the ability to find
relevant material in memory and to
deploy attention when needed.”
― Daniel Kahneman, Thinking, Fast and Slow
Data Pipeline
a set of data processing elements
connected in series, where the output
of one element is the input of the next
one
“
2) Big Data Problems at FPT
2.1 VnExpress: scaling Analytics system
2.2 eClick: real-time data pipeline
50 GB Log/day for real-time processing
300 GB Log/day for batch processing
processing nearly 5000 events in 1 second !
less than 5 seconds for critical KPI
is the max latency from tracking to reacting !
20 millions Vietnamese online user
that’s the number of active Vietnamese user
from around the World
about Big Data System at FPT Online
Problems at VnExpress
◦ Counting page-view for all articles
◦ Predicting the reach number for a
specific topic, keyword, URL.
◦ Computing the social trending from
Facebook
◦ Classifying positive/negative
comments
◦ Viral score for journal topics
◦ Smarter content editor
◦ Real-time newsroom
Problem: Which topics most commented on
Problem: Measure the heatmap of User Activity
The architecture to Social Media Analytics
Challenges at eClick’s backend system
◦ Should be Faster
◦ Must be Scalable
◦ Must be “anti-fraud”
◦ Should be Maintainable
◦ More Agile in Data Analytics
◦ Reactive for important events in real-
time
◦ Connecting the Seller Side and Buyer
Side better
at eClick we have
~50 GB Log for Stream
~300 GB Log for Batch
just for tracking user
actions (click,
impression,...)
in ONE day !
at eClick we must
check campaigns in
near-real-time
(seconds) !
at eClick we have many types of log
(video, web, mobile, ad-campaigns, articles, … )
How problem was solved
in 2012, that’s all I have ! And I chose the Storm
Finding a framework
Learn & Build the system in agile way
“Reacting to change quickly”
Before implementing Lambda Architecture at FPT
That’s the Batch Processing Architecture
Before implementing Lambda Architecture
at FPT (Before May 1, 2013 )
Aggregating Log Data
(Scribe)
File System
Statistics with SQL
(MySQL)
Reporting RDBMS
(Oracle)
Problems with Batch Processing Architecture
◦ Checking advertising campaign takes
more than 5 minutes
◦ Hard to scale for 20 million users
◦ Not fault tolerant (by File System)
◦ More than 15 minutes for reporting
(File → MySQL + Python → Oracle)
Lambda Architecture 1.0 at FPT
Key idea: Adapting Apache Kafka
for distributed messaging and event queue
http://blog.confluent.io/2015/02/25/stream-data-platform-1/
Key idea: data as streams
“Lambda architecture 1.0”
at FPT Online
from May 1, 2013
to April 30, 2015
Source: http://www.manning.com/marz/
Processing Topology at eClick
Kafka Cluster
topic: impression
Tokenizing
processor
Parsing
Processor
Checking
campaign
processor
Statistics
Processor
Storing Data
Warehouse
Kafka
Consumer
Campaign
database
Real-time
Statistics
database
Data
Warehouse
Kafka Cluster
topic: advertising-data
Optimize
advertising
functor
Improvements after applying
Lambda Architecture version 1
◦ Checking advertising campaign in
less than 5 seconds (near real-time)
◦ Scalability for 20 million users
◦ Fault tolerance (thank to Kafka)
◦ Take less than 3 minutes for
reporting (Kafka → Redis → Oracle)
◦ Take less than 1 hour for Data Mining
Jobs (with HBase and Java code)
Lambda Architecture 2.0 at FPT
Make everything as simple as possible, but not simpler.
In version 2, we want to react to the World faster
Why is λ 2.0?
Because λ 1.0 does not deliver
“ Fast Personalized UX” for Agile Business
→ Reactive Advertising for Digital Media
→
User Activity Heatmap
Improvements in Lambda Architecture version 2
◦ Fast Processing for Data Mining / Learning
◦ Fast Agile for Data Science
◦ Fast Reactive for critical event in real-time
◦ Fast Connecting from demand to supply
◦ Fast Sharing information among the “Actor”
“Lambda Architecture 2.0”
will be “In Production” after June 1, 2015
Problem: tracking, crawling and measuring social
trends and user behaviors from social media
“
4) Fast Data Architecture at FPT
Built with Lambda Architecture 2.0
reacting to
critical event
faster
for data mining
and machine
learning on Big
Data
for observable
data
SQL querying (SQL
is true lambda
language !?)
We use new CONCEPTS for abstracting
all things in system
Data Actor
Is the source for all data
Data Pipeline
just special data queue
Data Event
is what’s happened in
specific context, emitted
by data actor
Processing
Topology
is a directed graph of
event processors
Event processor
is the logical processor
for data pipeline
Reactive functor
is the is a functional actor
that responds (reacts) to
external data events
The architecture for Fast Data:
Reactive real-time data pipeline
1) Data collector (I/O networking)
● Netty for event collector and HTTP server (rfx-track)
2) Data persistence (aka: data storage)
● Kafka for distributed message queue(Apache Kafka)
● HBase for scalable big table
3) Data processing
● RFX with stream processor (RFX framework)
● Apache Spark for in-memory batch processing
● RxJava + Akka for reactive processor (reactivex)
4) Data analysis
● measures of uncertainty
○ Python Dempster-Shafer theory → anti-fraud
5) Data reporting
● SQL with Phoenix (http://phoenix.apache.org )
● real-time frontend report: NodeJs, SocketIO
Technology stack ( 5D model )
“
5) What we have learned
1) Understanding problems
2) Pick right framework
3) Build
4) Test
5) Measure
6) Learn
7) Go back to step 3
It’s time to solve practical problem “Lambda for Vacuum Cleaner”
1) push the “Hello” message into vacuum cleaner
2) take it out !
3) display message “HELLO” in readable and ordered
format?
Question 1: Which data structure could be used to
implement these tasks ?
Question 2: Can you solve it using a functional
programming language?
Question 3: Write code to prove your solution
Solution code for problem “Lambda for Vacuum Cleaner”
“In a concurrent world, imperative is the wrong default !”
– Tim Sweeney, Epic Games
Hey, back-end engineers and
front-end engineers
Want to join team FA2
at FPT?
Please submit your CV to
tantrieuf31@gmail.com

More Related Content

What's hot

Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0
Dr. Mohan K. Bavirisetty
 

What's hot (20)

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
 
Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0
 
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
 
Big Data Monitoring Cockpit
Big Data Monitoring CockpitBig Data Monitoring Cockpit
Big Data Monitoring Cockpit
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
 
Migrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudMigrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the Cloud
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
 
The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren Shure
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
Natalie Godec - AirFlow and GCP: tomorrow's health service data platform
Natalie Godec - AirFlow and GCP: tomorrow's health service data platformNatalie Godec - AirFlow and GCP: tomorrow's health service data platform
Natalie Godec - AirFlow and GCP: tomorrow's health service data platform
 

Viewers also liked

Effects of hashtag in Instagram Marketing
Effects of hashtag in Instagram MarketingEffects of hashtag in Instagram Marketing
Effects of hashtag in Instagram Marketing
Hoà Bùi
 
Admicro PR Solution 2014
Admicro PR Solution 2014Admicro PR Solution 2014
Admicro PR Solution 2014
Nguyen Thanh
 

Viewers also liked (20)

Marketing and opera
Marketing and operaMarketing and opera
Marketing and opera
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Introduction to RFX for Backend Developer
Introduction to RFX for Backend DeveloperIntroduction to RFX for Backend Developer
Introduction to RFX for Backend Developer
 
Big data infrastructure todo-tasks Rfx Framework
Big data infrastructure todo-tasks Rfx FrameworkBig data infrastructure todo-tasks Rfx Framework
Big data infrastructure todo-tasks Rfx Framework
 
From Data Analytics to Fast Data Intelligence
From Data Analytics to Fast Data IntelligenceFrom Data Analytics to Fast Data Intelligence
From Data Analytics to Fast Data Intelligence
 
Using User Behavior for Real-time Advertising
Using User Behavior for Real-time AdvertisingUsing User Behavior for Real-time Advertising
Using User Behavior for Real-time Advertising
 
Parallel and Iterative Processing for Machine Learning Recommendations with S...
Parallel and Iterative Processing for Machine Learning Recommendations with S...Parallel and Iterative Processing for Machine Learning Recommendations with S...
Parallel and Iterative Processing for Machine Learning Recommendations with S...
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
Giới thiệu cơ bản về Big Data và các ứng dụng thực tiễn
Giới thiệu cơ bản về Big Data và các ứng dụng thực tiễnGiới thiệu cơ bản về Big Data và các ứng dụng thực tiễn
Giới thiệu cơ bản về Big Data và các ứng dụng thực tiễn
 
Social Media power on Shoppers' decision journey
Social Media power on Shoppers' decision journeySocial Media power on Shoppers' decision journey
Social Media power on Shoppers' decision journey
 
Content Marketing - Make an Impression
Content Marketing - Make an ImpressionContent Marketing - Make an Impression
Content Marketing - Make an Impression
 
[Materials];[Admicro tang hieu_suat_quang_cao]
[Materials];[Admicro tang hieu_suat_quang_cao][Materials];[Admicro tang hieu_suat_quang_cao]
[Materials];[Admicro tang hieu_suat_quang_cao]
 
Frnog26
Frnog26Frnog26
Frnog26
 
Effects of hashtag in Instagram Marketing
Effects of hashtag in Instagram MarketingEffects of hashtag in Instagram Marketing
Effects of hashtag in Instagram Marketing
 
Admicro mobileads profile 0946.251.335
Admicro mobileads profile 0946.251.335Admicro mobileads profile 0946.251.335
Admicro mobileads profile 0946.251.335
 
FTI Intro
FTI IntroFTI Intro
FTI Intro
 
Remote 2 android - Poly sáng tạo 2016 - Sinh viên FPT Polytechnic
Remote 2 android - Poly sáng tạo 2016 - Sinh viên FPT PolytechnicRemote 2 android - Poly sáng tạo 2016 - Sinh viên FPT Polytechnic
Remote 2 android - Poly sáng tạo 2016 - Sinh viên FPT Polytechnic
 
Profile Admicro - English
Profile Admicro - EnglishProfile Admicro - English
Profile Admicro - English
 
Admicro PR Solution 2014
Admicro PR Solution 2014Admicro PR Solution 2014
Admicro PR Solution 2014
 
Cẩm nang content marketing
Cẩm nang content marketingCẩm nang content marketing
Cẩm nang content marketing
 

Similar to Building Reactive Real-time Data Pipeline

Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
confluent
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Yael Garten
 

Similar to Building Reactive Real-time Data Pipeline (20)

Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Open Blueprint for Real-Time Analytics in Retail: Strata Hadoop World 2017 S...
Open Blueprint for Real-Time  Analytics in Retail: Strata Hadoop World 2017 S...Open Blueprint for Real-Time  Analytics in Retail: Strata Hadoop World 2017 S...
Open Blueprint for Real-Time Analytics in Retail: Strata Hadoop World 2017 S...
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFlux
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 

More from Trieu Nguyen

[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP
Trieu Nguyen
 

More from Trieu Nguyen (20)

Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdfBuilding Your Customer Data Platform with LEO CDP in Travel Industry.pdf
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf
 
Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business
Building Your Customer Data Platform with LEO CDP - Spa and Hotel BusinessBuilding Your Customer Data Platform with LEO CDP - Spa and Hotel Business
Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business
 
Building Your Customer Data Platform with LEO CDP
Building Your Customer Data Platform with LEO CDP Building Your Customer Data Platform with LEO CDP
Building Your Customer Data Platform with LEO CDP
 
How to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDPHow to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDP
 
[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP
 
Leo CDP - Pitch Deck
Leo CDP - Pitch DeckLeo CDP - Pitch Deck
Leo CDP - Pitch Deck
 
LEO CDP - What's new in 2022
LEO CDP  - What's new in 2022LEO CDP  - What's new in 2022
LEO CDP - What's new in 2022
 
Lộ trình triển khai LEO CDP cho ngành bất động sản
Lộ trình triển khai LEO CDP cho ngành bất động sảnLộ trình triển khai LEO CDP cho ngành bất động sản
Lộ trình triển khai LEO CDP cho ngành bất động sản
 
Why is LEO CDP important for digital business ?
Why is LEO CDP important for digital business ?Why is LEO CDP important for digital business ?
Why is LEO CDP important for digital business ?
 
From Dataism to Customer Data Platform
From Dataism to Customer Data PlatformFrom Dataism to Customer Data Platform
From Dataism to Customer Data Platform
 
Data collection, processing & organization with USPA framework
Data collection, processing & organization with USPA frameworkData collection, processing & organization with USPA framework
Data collection, processing & organization with USPA framework
 
Part 1: Introduction to digital marketing technology
Part 1: Introduction to digital marketing technologyPart 1: Introduction to digital marketing technology
Part 1: Introduction to digital marketing technology
 
Why is Customer Data Platform (CDP) ?
Why is Customer Data Platform (CDP) ?Why is Customer Data Platform (CDP) ?
Why is Customer Data Platform (CDP) ?
 
How to build a Personalized News Recommendation Platform
How to build a Personalized News Recommendation PlatformHow to build a Personalized News Recommendation Platform
How to build a Personalized News Recommendation Platform
 
How to grow your business in the age of digital marketing 4.0
How to grow your business  in the age of digital marketing 4.0How to grow your business  in the age of digital marketing 4.0
How to grow your business in the age of digital marketing 4.0
 
Video Ecosystem and some ideas about video big data
Video Ecosystem and some ideas about video big dataVideo Ecosystem and some ideas about video big data
Video Ecosystem and some ideas about video big data
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Open OTT - Video Content Platform
Open OTT - Video Content PlatformOpen OTT - Video Content Platform
Open OTT - Video Content Platform
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
Introduction to Recommendation Systems (Vietnam Web Submit)
Introduction to Recommendation Systems (Vietnam Web Submit)Introduction to Recommendation Systems (Vietnam Web Submit)
Introduction to Recommendation Systems (Vietnam Web Submit)
 

Building Reactive Real-time Data Pipeline

  • 2. Basic info ◦ I work at FPT Telecom, tech lead of the team FA2 (Rich Media Analytics and Advertising) ◦ My focus about data infrastructure, backend system and reactive system architecture. ◦ My blog at nguyentantrieu.info/blog ◦ Founding a small Data Lab at mc2ads.org and has started R&D about Fast Data since 2014 Note: ◦ All contents and thoughts in this slide are my subjective ideas and work experience. ◦ The slide is written for meetup event at grokkingengineering.org
  • 3. Contents 1. What is “Data Pipeline” ? 2. Big Data Problems at FPT a. VnExpress: pageview and heat-map b. eClick: real-time reactive advertising 3. Solutions and Patterns 4. Fast Data Architecture at FPT 5. Wrap up
  • 4. “ 1) What is “Data Pipeline” ? “Intelligence is not only the ability to reason; it is also the ability to find relevant material in memory and to deploy attention when needed.” ― Daniel Kahneman, Thinking, Fast and Slow
  • 5. Data Pipeline a set of data processing elements connected in series, where the output of one element is the input of the next one
  • 6. “ 2) Big Data Problems at FPT 2.1 VnExpress: scaling Analytics system 2.2 eClick: real-time data pipeline
  • 7. 50 GB Log/day for real-time processing 300 GB Log/day for batch processing processing nearly 5000 events in 1 second ! less than 5 seconds for critical KPI is the max latency from tracking to reacting ! 20 millions Vietnamese online user that’s the number of active Vietnamese user from around the World about Big Data System at FPT Online
  • 8. Problems at VnExpress ◦ Counting page-view for all articles ◦ Predicting the reach number for a specific topic, keyword, URL. ◦ Computing the social trending from Facebook ◦ Classifying positive/negative comments ◦ Viral score for journal topics ◦ Smarter content editor ◦ Real-time newsroom
  • 9. Problem: Which topics most commented on
  • 10. Problem: Measure the heatmap of User Activity
  • 11. The architecture to Social Media Analytics
  • 12. Challenges at eClick’s backend system ◦ Should be Faster ◦ Must be Scalable ◦ Must be “anti-fraud” ◦ Should be Maintainable ◦ More Agile in Data Analytics ◦ Reactive for important events in real- time ◦ Connecting the Seller Side and Buyer Side better
  • 13. at eClick we have ~50 GB Log for Stream ~300 GB Log for Batch just for tracking user actions (click, impression,...) in ONE day ! at eClick we must check campaigns in near-real-time (seconds) ! at eClick we have many types of log (video, web, mobile, ad-campaigns, articles, … )
  • 14. How problem was solved
  • 15. in 2012, that’s all I have ! And I chose the Storm Finding a framework
  • 16. Learn & Build the system in agile way “Reacting to change quickly”
  • 17. Before implementing Lambda Architecture at FPT That’s the Batch Processing Architecture
  • 18. Before implementing Lambda Architecture at FPT (Before May 1, 2013 ) Aggregating Log Data (Scribe) File System Statistics with SQL (MySQL) Reporting RDBMS (Oracle)
  • 19. Problems with Batch Processing Architecture ◦ Checking advertising campaign takes more than 5 minutes ◦ Hard to scale for 20 million users ◦ Not fault tolerant (by File System) ◦ More than 15 minutes for reporting (File → MySQL + Python → Oracle)
  • 21. Key idea: Adapting Apache Kafka for distributed messaging and event queue
  • 23. “Lambda architecture 1.0” at FPT Online from May 1, 2013 to April 30, 2015 Source: http://www.manning.com/marz/
  • 24. Processing Topology at eClick Kafka Cluster topic: impression Tokenizing processor Parsing Processor Checking campaign processor Statistics Processor Storing Data Warehouse Kafka Consumer Campaign database Real-time Statistics database Data Warehouse Kafka Cluster topic: advertising-data Optimize advertising functor
  • 25.
  • 26. Improvements after applying Lambda Architecture version 1 ◦ Checking advertising campaign in less than 5 seconds (near real-time) ◦ Scalability for 20 million users ◦ Fault tolerance (thank to Kafka) ◦ Take less than 3 minutes for reporting (Kafka → Redis → Oracle) ◦ Take less than 1 hour for Data Mining Jobs (with HBase and Java code)
  • 27.
  • 28. Lambda Architecture 2.0 at FPT Make everything as simple as possible, but not simpler. In version 2, we want to react to the World faster
  • 29. Why is λ 2.0? Because λ 1.0 does not deliver “ Fast Personalized UX” for Agile Business → Reactive Advertising for Digital Media →
  • 31. Improvements in Lambda Architecture version 2 ◦ Fast Processing for Data Mining / Learning ◦ Fast Agile for Data Science ◦ Fast Reactive for critical event in real-time ◦ Fast Connecting from demand to supply ◦ Fast Sharing information among the “Actor”
  • 32.
  • 33. “Lambda Architecture 2.0” will be “In Production” after June 1, 2015
  • 34. Problem: tracking, crawling and measuring social trends and user behaviors from social media
  • 35. “ 4) Fast Data Architecture at FPT
  • 36. Built with Lambda Architecture 2.0 reacting to critical event faster for data mining and machine learning on Big Data for observable data SQL querying (SQL is true lambda language !?)
  • 37. We use new CONCEPTS for abstracting all things in system Data Actor Is the source for all data Data Pipeline just special data queue Data Event is what’s happened in specific context, emitted by data actor Processing Topology is a directed graph of event processors Event processor is the logical processor for data pipeline Reactive functor is the is a functional actor that responds (reacts) to external data events
  • 38. The architecture for Fast Data: Reactive real-time data pipeline
  • 39. 1) Data collector (I/O networking) ● Netty for event collector and HTTP server (rfx-track) 2) Data persistence (aka: data storage) ● Kafka for distributed message queue(Apache Kafka) ● HBase for scalable big table 3) Data processing ● RFX with stream processor (RFX framework) ● Apache Spark for in-memory batch processing ● RxJava + Akka for reactive processor (reactivex) 4) Data analysis ● measures of uncertainty ○ Python Dempster-Shafer theory → anti-fraud 5) Data reporting ● SQL with Phoenix (http://phoenix.apache.org ) ● real-time frontend report: NodeJs, SocketIO Technology stack ( 5D model )
  • 40. “ 5) What we have learned
  • 41. 1) Understanding problems 2) Pick right framework 3) Build 4) Test 5) Measure 6) Learn 7) Go back to step 3
  • 42. It’s time to solve practical problem “Lambda for Vacuum Cleaner” 1) push the “Hello” message into vacuum cleaner 2) take it out ! 3) display message “HELLO” in readable and ordered format? Question 1: Which data structure could be used to implement these tasks ? Question 2: Can you solve it using a functional programming language? Question 3: Write code to prove your solution
  • 43. Solution code for problem “Lambda for Vacuum Cleaner”
  • 44.
  • 45. “In a concurrent world, imperative is the wrong default !” – Tim Sweeney, Epic Games
  • 46. Hey, back-end engineers and front-end engineers Want to join team FA2 at FPT? Please submit your CV to tantrieuf31@gmail.com