How Data Engineers and Data Analysts approach Operating, Optimizing & Governing Big Data Platforms at Flipkart, India's e-commerce market leader. By Arya Ketan and Rishabh Dua
Flipkart Data Platform @ Scale - slash n 2018 reprise
1. Flipkart Data Platform @ Scale
Arya Ketan, Rishabh Dua
Engineers @ Flipkart Tech
In God we trust. All others must bring data!
2. Flipkart confidential - For Internal use only. Not to be shared externally.
Agenda
1. Data @ Flipkart
2. Data platform architecture
3. Challenges @ Scale
4. Operating
5. Storage & Compute Optimizations
6. Data Governance
4. Flipkart confidential - For Internal use only. Not to be shared externally.
Who are the users?
“Torture the data, and it will confess to anything.”
5. Flipkart confidential - For Internal use only. Not to be shared externally.
Big Data - no longer just a buzzword
80% DATA
< 2 years old
15+ PB
HDFS files
3 billion +
events
ingested daily
400 billion +
container
hours daily
30+ TB
Ingested daily
10. Flipkart confidential - For Internal use only. Not to be shared externally.
Challenges in batch processing
Classic Batch pattern
● Fixed window cycles
● Repeated every window
Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
11. Flipkart confidential - For Internal use only. Not to be shared externally.
Challenges in batch processing
● Breaks down when used with sophisticated window strategies
Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
● Businesses crave more timely data
● Non even workload spreads
Session
12. Flipkart confidential - For Internal use only. Not to be shared externally.
Stream processing patterns
● Stream
○ Low latency but approximate results
○ Unordered data of varying event-time skew
● Event time :
which is the time at which
events actually occurred.
● Processing time:
which is the time at
which events
are observed in the system.
13. Flipkart confidential - For Internal use only. Not to be shared externally.
Lambda Architecture
14. Flipkart confidential - For Internal use only. Not to be shared externally.
Semantics for unbounded data
● Time-agnostic
● Approximation
Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
15. Flipkart confidential - For Internal use only. Not to be shared externally.
Semantics for unbounded data
Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Windowing by
Processing time
Windowing by
Event time
16. Flipkart confidential - For Internal use only. Not to be shared externally.
Batch to fStream
● Streaming applications
○ f-SQL( ANSI-SQL compliant)
17. Flipkart confidential - For Internal use only. Not to be shared externally.
Batch to fStream
● Streaming applications
○ Materialized time windows
HBASE
Time Partitioned
Aggregates
18. Flipkart confidential - For Internal use only. Not to be shared externally.
Improvements
● Lower latency of freshness
○ User Insight prediction
○ Trust and Safety Interventions
● Newer features for data-science
○ User sessionization
● Lower resource consumption
20. Flipkart confidential - For Internal use only. Not to be shared externally.
Overload @ constant capacity
● More users, more use cases, more jobs, more resources
○ 100x Increase in compute hours
● Hardware unavailability in DataCenter to scale at same
rate
○ 1.1x increase in machine instances
21. Flipkart confidential - For Internal use only. Not to be shared externally.
Job analysis
● Problems in jobs are not obvious
● Lot of possible configurations - Hive, Hadoop, HDFS, Spark, JVM
● Inter-related settings
● Information & metrics are scattered
22. Flipkart confidential - For Internal use only. Not to be shared externally.
Optimizing compute usage
● Automated performance monitoring and tuning
tool
● Indicates best practices and tuning tips
● Best performance for every job
DR Elephant to the rescue
http://github.com/linkedin/dr-elephant/
23. Flipkart confidential - For Internal use only. Not to be shared externally.
Dr Elephant
Dashboard
24. Flipkart confidential - For Internal use only. Not to be shared externally.
Dr Elephant - Heuristics & Severity
25. Flipkart confidential - For Internal use only. Not to be shared externally.
Optimizing compute - Tez vs Mapreduce
● Tez creates DAG of tasks.
Compared to MR
○ No intermediate data written
○ Larger memory footprint
No one size fits all
● Assigner chooses compute engine
○ Container hours
○ Resources used
○ Configuration tweaking
Job
Assigner
TEZ
To be
scheduled
Compute
engine
chosen
MR
26. Flipkart confidential - For Internal use only. Not to be shared externally.
Optimizing storage
JSON AVRO ORC
Many Storage Formats
27. Flipkart confidential - For Internal use only. Not to be shared externally.
Which storage format?
ORC vs Avro vs Parquet vs Json
● ORC / Parquet scores over Avro/Json
○ Encoding, dictionaries, indexes, projection pushdown, predicate
pushdown
● Choose Parquet if highly nested structures.
○ Note: We are working on feature in ORC + hive to support
predicate push down and projection pushdown.
28. Flipkart confidential - For Internal use only. Not to be shared externally.
Optimized storage format
● Columnar format
● Integrated compression, indexes and stats
● Predicate push down & Projection push down
● Run length Encoding
29. Flipkart confidential - For Internal use only. Not to be shared externally.
Improvements
● ORC
○ ~80pc savings in storage, ~60pc savings in compute
● Dr. Elephant
○ 2000+ jobs improved
○ ~70pc savings in compute
● Tez
○ 10-100x improvement in processing speed
31. Flipkart confidential - For Internal use only. Not to be shared externally.
Unreliability due to data issues
● What is source of truth for “Order Item Information”?
-- No way to annotate the data asset as blessed
● Why is this “Id” not in Data Platform?
-- Referential integrity constraints & validations are not supported
● Why Account-Id has invalid characters “%@#21323213”?
-- column is “account id” not just String.
● Why my data-table has yesterday’s data?
-- RCA of the dependencies is hard
32. Flipkart confidential - For Internal use only. Not to be shared externally.
Missing guard-rails & attribution
● Unrestricted usage of data
assets in the platform
● No minimum guarantees
on compute for Job execution
33. Flipkart confidential - For Internal use only. Not to be shared externally.
Lineage
● Data Assets Lineage
○ Easier RCA
○ Enables Reuse
○ Strategies to improve
data quality
34. Flipkart confidential - For Internal use only. Not to be shared externally.
1. Catalog of Data Assets
Schema & dependency definition
2. Classify and govern these assets
Attributes, tagging & security policies.
3. Collaboration capabilities around
these data assets
Ownership, accountability, subscriptions
What is Data Governance ?
35. Flipkart confidential - For Internal use only. Not to be shared externally.
Schema Tightening
● Why?
Identify data issue before entering the system
MicroService2
MicroService1
DATA
PLATFORM
INGESTION
Data Platform
AccountId:
ABC21312321333
AccountId:
FOO%%1231233
ERROR
AccountId:
ABC21312321333
36. Flipkart confidential - For Internal use only. Not to be shared externally.
Schema Tightening
How?
● Business Types
Eg AccountId, Price, OrderId
● Validations via JSON Schema
● Migrating to Schema
Tightened Entities
37. Flipkart confidential - For Internal use only. Not to be shared externally.
Data Quality Asserts
● Multiple Constraints
support
Eg. NULL Check, Variance, Referential
Change, Custom Query
● Auto triggered when fact is
finished
● Any one can Subscribe to an
Assert Rule
● Jira & Email integration
38. Flipkart confidential - For Internal use only. Not to be shared externally.
Org Queues
Why Org Queues?
Introduce fairness in allocation of Data Platform’s compute resources.
Optimize usage of already overloaded cluster, ensuring rogue jobs are preempted.
Features of Org Queue
● Guaranteed Minimum Compute
● Burstability & Pre-emption
● Sub queues of different sizes to improve reliability of P0 jobs
● Org Admins to manage the Users & Jobs in the queue
39. Flipkart confidential - For Internal use only. Not to be shared externally.
Features & Optimizations
● FStream
● Dr Elephant - Job Analysis
● Tez - Compute Engine
● ORC - Storage Format
Data Governance
● Dependency Lineage
● Schema Tightening
● DQ Asserts
● Org Queues
Summary
Challenges @ Scale
Overload Cluster @ Constant Capacity
Batch processing patterns
Data Quality issues
Missing guard-rails
40. Q & A
“Without big data, you are blind and deaf and in the
middle of Outer Ring Road.”