Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flipkart Data Platform @ Scale - slash n 2018 reprise

How Data Engineers and Data Analysts approach Operating, Optimizing & Governing Big Data Platforms at Flipkart, India's e-commerce market leader. By Arya Ketan and Rishabh Dua

  • Login to see the comments

Flipkart Data Platform @ Scale - slash n 2018 reprise

  1. 1. Flipkart Data Platform @ Scale Arya Ketan, Rishabh Dua Engineers @ Flipkart Tech In God we trust. All others must bring data!
  2. 2. Flipkart confidential - For Internal use only. Not to be shared externally. Agenda 1. Data @ Flipkart 2. Data platform architecture 3. Challenges @ Scale 4. Operating 5. Storage & Compute Optimizations 6. Data Governance
  3. 3. Data @ Flipkart
  4. 4. Flipkart confidential - For Internal use only. Not to be shared externally. Who are the users? “Torture the data, and it will confess to anything.”
  5. 5. Flipkart confidential - For Internal use only. Not to be shared externally. Big Data - no longer just a buzzword 80% DATA < 2 years old 15+ PB HDFS files 3 billion + events ingested daily 400 billion + container hours daily 30+ TB Ingested daily
  6. 6. Data Platform Architecture
  7. 7. Flipkart confidential - For Internal use only. Not to be shared externally. Architecture
  8. 8. Challenges @ Scale Operating ● Predictability ● Reliability
  9. 9. Operating data platform
  10. 10. Flipkart confidential - For Internal use only. Not to be shared externally. Challenges in batch processing Classic Batch pattern ● Fixed window cycles ● Repeated every window Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  11. 11. Flipkart confidential - For Internal use only. Not to be shared externally. Challenges in batch processing ● Breaks down when used with sophisticated window strategies Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 ● Businesses crave more timely data ● Non even workload spreads Session
  12. 12. Flipkart confidential - For Internal use only. Not to be shared externally. Stream processing patterns ● Stream ○ Low latency but approximate results ○ Unordered data of varying event-time skew ● Event time : which is the time at which events actually occurred. ● Processing time: which is the time at which events are observed in the system.
  13. 13. Flipkart confidential - For Internal use only. Not to be shared externally. Lambda Architecture
  14. 14. Flipkart confidential - For Internal use only. Not to be shared externally. Semantics for unbounded data ● Time-agnostic ● Approximation Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  15. 15. Flipkart confidential - For Internal use only. Not to be shared externally. Semantics for unbounded data Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 Windowing by Processing time Windowing by Event time
  16. 16. Flipkart confidential - For Internal use only. Not to be shared externally. Batch to fStream ● Streaming applications ○ f-SQL( ANSI-SQL compliant)
  17. 17. Flipkart confidential - For Internal use only. Not to be shared externally. Batch to fStream ● Streaming applications ○ Materialized time windows HBASE Time Partitioned Aggregates
  18. 18. Flipkart confidential - For Internal use only. Not to be shared externally. Improvements ● Lower latency of freshness ○ User Insight prediction ○ Trust and Safety Interventions ● Newer features for data-science ○ User sessionization ● Lower resource consumption
  19. 19. Optimizing data platform to improve predictability
  20. 20. Flipkart confidential - For Internal use only. Not to be shared externally. Overload @ constant capacity ● More users, more use cases, more jobs, more resources ○ 100x Increase in compute hours ● Hardware unavailability in DataCenter to scale at same rate ○ 1.1x increase in machine instances
  21. 21. Flipkart confidential - For Internal use only. Not to be shared externally. Job analysis ● Problems in jobs are not obvious ● Lot of possible configurations - Hive, Hadoop, HDFS, Spark, JVM ● Inter-related settings ● Information & metrics are scattered
  22. 22. Flipkart confidential - For Internal use only. Not to be shared externally. Optimizing compute usage ● Automated performance monitoring and tuning tool ● Indicates best practices and tuning tips ● Best performance for every job DR Elephant to the rescue http://github.com/linkedin/dr-elephant/
  23. 23. Flipkart confidential - For Internal use only. Not to be shared externally. Dr Elephant Dashboard
  24. 24. Flipkart confidential - For Internal use only. Not to be shared externally. Dr Elephant - Heuristics & Severity
  25. 25. Flipkart confidential - For Internal use only. Not to be shared externally. Optimizing compute - Tez vs Mapreduce ● Tez creates DAG of tasks. Compared to MR ○ No intermediate data written ○ Larger memory footprint No one size fits all ● Assigner chooses compute engine ○ Container hours ○ Resources used ○ Configuration tweaking Job Assigner TEZ To be scheduled Compute engine chosen MR
  26. 26. Flipkart confidential - For Internal use only. Not to be shared externally. Optimizing storage JSON AVRO ORC Many Storage Formats
  27. 27. Flipkart confidential - For Internal use only. Not to be shared externally. Which storage format? ORC vs Avro vs Parquet vs Json ● ORC / Parquet scores over Avro/Json ○ Encoding, dictionaries, indexes, projection pushdown, predicate pushdown ● Choose Parquet if highly nested structures. ○ Note: We are working on feature in ORC + hive to support predicate push down and projection pushdown.
  28. 28. Flipkart confidential - For Internal use only. Not to be shared externally. Optimized storage format ● Columnar format ● Integrated compression, indexes and stats ● Predicate push down & Projection push down ● Run length Encoding
  29. 29. Flipkart confidential - For Internal use only. Not to be shared externally. Improvements ● ORC ○ ~80pc savings in storage, ~60pc savings in compute ● Dr. Elephant ○ 2000+ jobs improved ○ ~70pc savings in compute ● Tez ○ 10-100x improvement in processing speed
  30. 30. Data Governance With great power comes great responsibility. - Uncle Ben
  31. 31. Flipkart confidential - For Internal use only. Not to be shared externally. Unreliability due to data issues ● What is source of truth for “Order Item Information”? -- No way to annotate the data asset as blessed ● Why is this “Id” not in Data Platform? -- Referential integrity constraints & validations are not supported ● Why Account-Id has invalid characters “%@#21323213”? -- column is “account id” not just String. ● Why my data-table has yesterday’s data? -- RCA of the dependencies is hard
  32. 32. Flipkart confidential - For Internal use only. Not to be shared externally. Missing guard-rails & attribution ● Unrestricted usage of data assets in the platform ● No minimum guarantees on compute for Job execution
  33. 33. Flipkart confidential - For Internal use only. Not to be shared externally. Lineage ● Data Assets Lineage ○ Easier RCA ○ Enables Reuse ○ Strategies to improve data quality
  34. 34. Flipkart confidential - For Internal use only. Not to be shared externally. 1. Catalog of Data Assets Schema & dependency definition 2. Classify and govern these assets Attributes, tagging & security policies. 3. Collaboration capabilities around these data assets Ownership, accountability, subscriptions What is Data Governance ?
  35. 35. Flipkart confidential - For Internal use only. Not to be shared externally. Schema Tightening ● Why? Identify data issue before entering the system MicroService2 MicroService1 DATA PLATFORM INGESTION Data Platform AccountId: ABC21312321333 AccountId: FOO%%1231233 ERROR AccountId: ABC21312321333
  36. 36. Flipkart confidential - For Internal use only. Not to be shared externally. Schema Tightening How? ● Business Types Eg AccountId, Price, OrderId ● Validations via JSON Schema ● Migrating to Schema Tightened Entities
  37. 37. Flipkart confidential - For Internal use only. Not to be shared externally. Data Quality Asserts ● Multiple Constraints support Eg. NULL Check, Variance, Referential Change, Custom Query ● Auto triggered when fact is finished ● Any one can Subscribe to an Assert Rule ● Jira & Email integration
  38. 38. Flipkart confidential - For Internal use only. Not to be shared externally. Org Queues Why Org Queues? Introduce fairness in allocation of Data Platform’s compute resources. Optimize usage of already overloaded cluster, ensuring rogue jobs are preempted. Features of Org Queue ● Guaranteed Minimum Compute ● Burstability & Pre-emption ● Sub queues of different sizes to improve reliability of P0 jobs ● Org Admins to manage the Users & Jobs in the queue
  39. 39. Flipkart confidential - For Internal use only. Not to be shared externally. Features & Optimizations ● FStream ● Dr Elephant - Job Analysis ● Tez - Compute Engine ● ORC - Storage Format Data Governance ● Dependency Lineage ● Schema Tightening ● DQ Asserts ● Org Queues Summary Challenges @ Scale Overload Cluster @ Constant Capacity Batch processing patterns Data Quality issues Missing guard-rails
  40. 40. Q & A “Without big data, you are blind and deaf and in the middle of Outer Ring Road.”
  41. 41. Flipkart confidential - For Internal use only. Not to be shared externally. THANKS

×