SlideShare a Scribd company logo
1 of 44
Download to read offline
Lambda Architecture 
Analyzing large scale, unstructured, 
dynamic data 
Rajesh Muppalla (@codingnirvana) 
rajesh@indix.com
Indix - Quick Overview 
Am I priced higher or lower w.r.t 
my competitor on Nikon D700? 
Which product has the UPC - 
8745354434? 
What are all the variants of 
Apple Macbook Air 13”? What is the average price change of all Nike Shoes 
in Walmart in the last 3 months?
Data Pipeline @ Indix 
C 
Crawling Parsing 
ML 
Model 
ML 
Model 
Classification 
C1 C1 C1 C1 
C2 C2 C2 
C2 C2 
Matching 
Product & Price 
Catalog
Data Pipeline @ Indix 
Analytics 
(Precomputes, 
Insights) 
Search Index 
Product & Price 
Catalog 
Experiences 
We released the v1.0 of our API today - developer.indix.com
Data is Dynamic 
C C1 C1 C1 C1 
C2 C2 C2 
C2 C2 
ML 
Model 
ML 
Model 
(new) 
Crawling Parsing Classification Matching
Data Scale 
400 M 
Product 
URLs 4 TB 
HTML Data 
Crawled 
Daily 
100 TB 
Data 
Processed 
Daily 
3000 
Categories 
10 B 
Price 
Points 
2000 
Sites
Data Pipeline v1.0
Batch using HBase & MapReduce
Problem 1 
Mutable State 
Data Systems should be Human Fault Tolerant
Problem 2 
Compactions 
Random Write databases are hard to manage at large scale
Problem 3 
16 hours 
16 hours latency is a lot. We wanted it to be couple of hours
Three Problems 
● No Human Fault Tolerance 
○ Mutable State 
● Operational Complexity 
○ Random Writes (Compactions) 
● High Latency 
○ Batch system architectural tradeoff
Rethink our data systems
Lambda Architecture
Lambda Architecture 
● An approach to build big data systems 
○ Architectural Components & Principles 
○ Ties Batch & Real Time Systems 
○ General Purpose - Domain Agnostic 
● Coined by Nathan Marz 
○ Ex-Twitter Engineer 
○ Creator of Storm
Data System - Traditional Approach 
HBase 
Application 
Source of Truth
Data System - New Approach 
Immutable 
Raw 
Data 
Application 
Processed 
View(s) 
Source of Truth
Let’s take an example 
Find the count of unique products in any 
given category for the entire time range
Two Requirements 
● Recomputations 
● Large Scale
Batch Layer Implementation 
HDFS (Vertical Partitioning) HBase 
C1 5 
C2 7 
C3 4 
C4 7 
C5 1 
Products Master Data 
9 am 
10 am 
11 am 
12 pm 
1 pm 
2 pm 
Query 
Intermediate view 
C1 
C2 
C3 
C4 
C5 
MR Job 1 
Batch View 
New Data MR Job 2
Handling Recomputations 
HDFS (Vertical Partitioning) HBase 
C1 5 
C2 7 
C3 4 
C4 7 
C5 1 
Products Master Data 
9 am 
10 am 
11 am 
12 pm 
1 pm 
2 pm 
Query 
Intermediate view 
C1 
C2 
C3 
C4 
C5 
MR Job 1 
Batch View 
New Data MR Job 2
Handling Scale 
● Hadoop HDFS, MapReduce, HBase 
● Proven Linear Scalability
Three Problems (Recap) 
● No Human Fault Tolerance 
○ Mutable State 
● Operational Complexity 
○ Random Writes (Compactions) 
● High Latency 
○ Batch system architectural tradeoff
Human Fault Tolerance 
● Bugs in the batch jobs 
○ Discard views & Recompute 
● Bugs in the master data jobs 
○ Re-process the master data to hide the old data 
● Bugs in the query 
○ Re-deploy the query layer 
● Traceability as a side effect
Operational Complexity 
● No random writes in the batch layer 
○ Bulk Updates to build the batch view
Great… What about Latency?
Speed Layer 
Queue 
(Kafka) 
Recent Data 
Real Time Processing 
(Storm) 
HHyyppeHerylroplogeglrolloogg gS lSeoetgst s Query 
Random 
Writes 
(Updates) 
Read-Write Data Store 
(Riak, HBase, 
Cassandra)
Speed Layer has mutation... But 
● Speed layer deals with much smaller data 
○ Batch Layer - Months/years of data 
○ Speed Layer - Few hours or 1 day of data 
● Easy to manage operationally 
Complexity Isolation
Final Step - Merging Results 
Batch Layer 
Speed Layer 
Data 
Query 
Merged Results 
C1 - 50000 
C1 - 499 
(Approximate with 
error 0.02%) 
C1 - 50499
What about Accuracy? 
Batch Layer 
Speed Layer 
Data 
Query 
Merged Results 
C1 - 499 
(Approximate with 
error 0.02%) 
C1’ - 50500 
Batch Layer 
CC11’ -- 5500050000 
Eventually Accurate
Lambda Architecture
Lambda Architecture @ INDIX
Lambda Architecture @ Indix
Batch Layer @ Indix 
● Pail 
○ Vertical partitioning 
○ Consolidation of small files 
● Scalding 
● Thrift for enforcing schemas 
● HBase/Solr for views 
○ Bulk updates to create views
Speed Layer @ Indix 
● Still WIP 
● To reduce latency 
○ Micro batches for Speed layer 
○ Use the last batch run + bulk update views
Open Challenges 
● Managing both Batch & Real Time still painful 
● Two broad directions 
○ Abstractions 
■ SummingBird (Twitter) 
○ Unified Stack 
■ Spark 
■ Kafka + Samza/Storm (LinkedIn) 
■ Cloud Data Flow (Google)
In Conclusion... 
● Lambda Architecture 
○ A different approach to build data systems 
○ Solid principles 
○ Domain Agnostic 
○ Tools not yet mature
Resources 
● Indix Engineering Blog - http://engineering.indix.com 
● Runaway Complexity in Big Data Systems 
● Lambda Architecture 
● Big Data Book - Manning 
● Scalding 
● Spark 
● Pail 
● Summingbird
Key Takeaways 
- Human Fault Tolerance 
- Complexity Isolation 
- Higher Level Abstractions
Thank You
Batch vs Real Time Choices
Tying it all together - Go-CD
Extras 
● Monoids 
● LA is not new 
○ Search Engines (fast, slow crawl) 
○ Event Sourcing (immutable events to maintain 
state) 
○ Patch, Audit, Bootstrap
Problem Statement - Optimization

More Related Content

What's hot

A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 

What's hot (20)

Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 

Similar to Lambda architecture @ Indix

Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 

Similar to Lambda architecture @ Indix (20)

How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's students
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with Spark
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
"EventStoreDb: To be, or not to be, that is the question", Illia Maier
"EventStoreDb: To be, or not to be, that is the question",  Illia Maier"EventStoreDb: To be, or not to be, that is the question",  Illia Maier
"EventStoreDb: To be, or not to be, that is the question", Illia Maier
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
 
About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 

Recently uploaded

notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Recently uploaded (20)

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 

Lambda architecture @ Indix

  • 1. Lambda Architecture Analyzing large scale, unstructured, dynamic data Rajesh Muppalla (@codingnirvana) rajesh@indix.com
  • 2. Indix - Quick Overview Am I priced higher or lower w.r.t my competitor on Nikon D700? Which product has the UPC - 8745354434? What are all the variants of Apple Macbook Air 13”? What is the average price change of all Nike Shoes in Walmart in the last 3 months?
  • 3. Data Pipeline @ Indix C Crawling Parsing ML Model ML Model Classification C1 C1 C1 C1 C2 C2 C2 C2 C2 Matching Product & Price Catalog
  • 4. Data Pipeline @ Indix Analytics (Precomputes, Insights) Search Index Product & Price Catalog Experiences We released the v1.0 of our API today - developer.indix.com
  • 5. Data is Dynamic C C1 C1 C1 C1 C2 C2 C2 C2 C2 ML Model ML Model (new) Crawling Parsing Classification Matching
  • 6. Data Scale 400 M Product URLs 4 TB HTML Data Crawled Daily 100 TB Data Processed Daily 3000 Categories 10 B Price Points 2000 Sites
  • 8. Batch using HBase & MapReduce
  • 9. Problem 1 Mutable State Data Systems should be Human Fault Tolerant
  • 10. Problem 2 Compactions Random Write databases are hard to manage at large scale
  • 11. Problem 3 16 hours 16 hours latency is a lot. We wanted it to be couple of hours
  • 12. Three Problems ● No Human Fault Tolerance ○ Mutable State ● Operational Complexity ○ Random Writes (Compactions) ● High Latency ○ Batch system architectural tradeoff
  • 13. Rethink our data systems
  • 15. Lambda Architecture ● An approach to build big data systems ○ Architectural Components & Principles ○ Ties Batch & Real Time Systems ○ General Purpose - Domain Agnostic ● Coined by Nathan Marz ○ Ex-Twitter Engineer ○ Creator of Storm
  • 16. Data System - Traditional Approach HBase Application Source of Truth
  • 17. Data System - New Approach Immutable Raw Data Application Processed View(s) Source of Truth
  • 18. Let’s take an example Find the count of unique products in any given category for the entire time range
  • 19. Two Requirements ● Recomputations ● Large Scale
  • 20. Batch Layer Implementation HDFS (Vertical Partitioning) HBase C1 5 C2 7 C3 4 C4 7 C5 1 Products Master Data 9 am 10 am 11 am 12 pm 1 pm 2 pm Query Intermediate view C1 C2 C3 C4 C5 MR Job 1 Batch View New Data MR Job 2
  • 21. Handling Recomputations HDFS (Vertical Partitioning) HBase C1 5 C2 7 C3 4 C4 7 C5 1 Products Master Data 9 am 10 am 11 am 12 pm 1 pm 2 pm Query Intermediate view C1 C2 C3 C4 C5 MR Job 1 Batch View New Data MR Job 2
  • 22. Handling Scale ● Hadoop HDFS, MapReduce, HBase ● Proven Linear Scalability
  • 23. Three Problems (Recap) ● No Human Fault Tolerance ○ Mutable State ● Operational Complexity ○ Random Writes (Compactions) ● High Latency ○ Batch system architectural tradeoff
  • 24. Human Fault Tolerance ● Bugs in the batch jobs ○ Discard views & Recompute ● Bugs in the master data jobs ○ Re-process the master data to hide the old data ● Bugs in the query ○ Re-deploy the query layer ● Traceability as a side effect
  • 25. Operational Complexity ● No random writes in the batch layer ○ Bulk Updates to build the batch view
  • 27. Speed Layer Queue (Kafka) Recent Data Real Time Processing (Storm) HHyyppeHerylroplogeglrolloogg gS lSeoetgst s Query Random Writes (Updates) Read-Write Data Store (Riak, HBase, Cassandra)
  • 28. Speed Layer has mutation... But ● Speed layer deals with much smaller data ○ Batch Layer - Months/years of data ○ Speed Layer - Few hours or 1 day of data ● Easy to manage operationally Complexity Isolation
  • 29. Final Step - Merging Results Batch Layer Speed Layer Data Query Merged Results C1 - 50000 C1 - 499 (Approximate with error 0.02%) C1 - 50499
  • 30. What about Accuracy? Batch Layer Speed Layer Data Query Merged Results C1 - 499 (Approximate with error 0.02%) C1’ - 50500 Batch Layer CC11’ -- 5500050000 Eventually Accurate
  • 34. Batch Layer @ Indix ● Pail ○ Vertical partitioning ○ Consolidation of small files ● Scalding ● Thrift for enforcing schemas ● HBase/Solr for views ○ Bulk updates to create views
  • 35. Speed Layer @ Indix ● Still WIP ● To reduce latency ○ Micro batches for Speed layer ○ Use the last batch run + bulk update views
  • 36. Open Challenges ● Managing both Batch & Real Time still painful ● Two broad directions ○ Abstractions ■ SummingBird (Twitter) ○ Unified Stack ■ Spark ■ Kafka + Samza/Storm (LinkedIn) ■ Cloud Data Flow (Google)
  • 37. In Conclusion... ● Lambda Architecture ○ A different approach to build data systems ○ Solid principles ○ Domain Agnostic ○ Tools not yet mature
  • 38. Resources ● Indix Engineering Blog - http://engineering.indix.com ● Runaway Complexity in Big Data Systems ● Lambda Architecture ● Big Data Book - Manning ● Scalding ● Spark ● Pail ● Summingbird
  • 39. Key Takeaways - Human Fault Tolerance - Complexity Isolation - Higher Level Abstractions
  • 41. Batch vs Real Time Choices
  • 42. Tying it all together - Go-CD
  • 43. Extras ● Monoids ● LA is not new ○ Search Engines (fast, slow crawl) ○ Event Sourcing (immutable events to maintain state) ○ Patch, Audit, Bootstrap
  • 44. Problem Statement - Optimization