Slides from my presentation on Lambda Architecture at Indix, presented at Fifth Elephant 2014.
It talks about our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
Lambda architecture @ Indix
1. Lambda Architecture
Analyzing large scale, unstructured,
dynamic data
Rajesh Muppalla (@codingnirvana)
rajesh@indix.com
2. Indix - Quick Overview
Am I priced higher or lower w.r.t
my competitor on Nikon D700?
Which product has the UPC -
8745354434?
What are all the variants of
Apple Macbook Air 13”? What is the average price change of all Nike Shoes
in Walmart in the last 3 months?
3. Data Pipeline @ Indix
C
Crawling Parsing
ML
Model
ML
Model
Classification
C1 C1 C1 C1
C2 C2 C2
C2 C2
Matching
Product & Price
Catalog
4. Data Pipeline @ Indix
Analytics
(Precomputes,
Insights)
Search Index
Product & Price
Catalog
Experiences
We released the v1.0 of our API today - developer.indix.com
5. Data is Dynamic
C C1 C1 C1 C1
C2 C2 C2
C2 C2
ML
Model
ML
Model
(new)
Crawling Parsing Classification Matching
6. Data Scale
400 M
Product
URLs 4 TB
HTML Data
Crawled
Daily
100 TB
Data
Processed
Daily
3000
Categories
10 B
Price
Points
2000
Sites
11. Problem 3
16 hours
16 hours latency is a lot. We wanted it to be couple of hours
12. Three Problems
● No Human Fault Tolerance
○ Mutable State
● Operational Complexity
○ Random Writes (Compactions)
● High Latency
○ Batch system architectural tradeoff
15. Lambda Architecture
● An approach to build big data systems
○ Architectural Components & Principles
○ Ties Batch & Real Time Systems
○ General Purpose - Domain Agnostic
● Coined by Nathan Marz
○ Ex-Twitter Engineer
○ Creator of Storm
16. Data System - Traditional Approach
HBase
Application
Source of Truth
17. Data System - New Approach
Immutable
Raw
Data
Application
Processed
View(s)
Source of Truth
18. Let’s take an example
Find the count of unique products in any
given category for the entire time range
23. Three Problems (Recap)
● No Human Fault Tolerance
○ Mutable State
● Operational Complexity
○ Random Writes (Compactions)
● High Latency
○ Batch system architectural tradeoff
24. Human Fault Tolerance
● Bugs in the batch jobs
○ Discard views & Recompute
● Bugs in the master data jobs
○ Re-process the master data to hide the old data
● Bugs in the query
○ Re-deploy the query layer
● Traceability as a side effect
25. Operational Complexity
● No random writes in the batch layer
○ Bulk Updates to build the batch view
27. Speed Layer
Queue
(Kafka)
Recent Data
Real Time Processing
(Storm)
HHyyppeHerylroplogeglrolloogg gS lSeoetgst s Query
Random
Writes
(Updates)
Read-Write Data Store
(Riak, HBase,
Cassandra)
28. Speed Layer has mutation... But
● Speed layer deals with much smaller data
○ Batch Layer - Months/years of data
○ Speed Layer - Few hours or 1 day of data
● Easy to manage operationally
Complexity Isolation
29. Final Step - Merging Results
Batch Layer
Speed Layer
Data
Query
Merged Results
C1 - 50000
C1 - 499
(Approximate with
error 0.02%)
C1 - 50499
30. What about Accuracy?
Batch Layer
Speed Layer
Data
Query
Merged Results
C1 - 499
(Approximate with
error 0.02%)
C1’ - 50500
Batch Layer
CC11’ -- 5500050000
Eventually Accurate
34. Batch Layer @ Indix
● Pail
○ Vertical partitioning
○ Consolidation of small files
● Scalding
● Thrift for enforcing schemas
● HBase/Solr for views
○ Bulk updates to create views
35. Speed Layer @ Indix
● Still WIP
● To reduce latency
○ Micro batches for Speed layer
○ Use the last batch run + bulk update views
36. Open Challenges
● Managing both Batch & Real Time still painful
● Two broad directions
○ Abstractions
■ SummingBird (Twitter)
○ Unified Stack
■ Spark
■ Kafka + Samza/Storm (LinkedIn)
■ Cloud Data Flow (Google)
37. In Conclusion...
● Lambda Architecture
○ A different approach to build data systems
○ Solid principles
○ Domain Agnostic
○ Tools not yet mature
38. Resources
● Indix Engineering Blog - http://engineering.indix.com
● Runaway Complexity in Big Data Systems
● Lambda Architecture
● Big Data Book - Manning
● Scalding
● Spark
● Pail
● Summingbird