SlideShare a Scribd company logo
1 of 41
Flipkart Data Platform @ Scale
Arya Ketan, Rishabh Dua
Engineers @ Flipkart Tech
In God we trust. All others must bring data!
Flipkart confidential - For Internal use only. Not to be shared externally.
Agenda
1. Data @ Flipkart
2. Data platform architecture
3. Challenges @ Scale
4. Operating
5. Storage & Compute Optimizations
6. Data Governance
Data @ Flipkart
Flipkart confidential - For Internal use only. Not to be shared externally.
Who are the users?
“Torture the data, and it will confess to anything.”
Flipkart confidential - For Internal use only. Not to be shared externally.
Big Data - no longer just a buzzword
80% DATA
< 2 years old
15+ PB
HDFS files
3 billion +
events
ingested daily
400 billion +
container
hours daily
30+ TB
Ingested daily
Data Platform Architecture
Flipkart confidential - For Internal use only. Not to be shared externally.
Architecture
Challenges @ Scale
Operating ● Predictability ● Reliability
Operating data platform
Flipkart confidential - For Internal use only. Not to be shared externally.
Challenges in batch processing
Classic Batch pattern
● Fixed window cycles
● Repeated every window
Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Flipkart confidential - For Internal use only. Not to be shared externally.
Challenges in batch processing
● Breaks down when used with sophisticated window strategies
Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
● Businesses crave more timely data
● Non even workload spreads
Session
Flipkart confidential - For Internal use only. Not to be shared externally.
Stream processing patterns
● Stream
○ Low latency but approximate results
○ Unordered data of varying event-time skew
● Event time :
which is the time at which
events actually occurred.
● Processing time:
which is the time at
which events
are observed in the system.
Flipkart confidential - For Internal use only. Not to be shared externally.
Lambda Architecture
Flipkart confidential - For Internal use only. Not to be shared externally.
Semantics for unbounded data
● Time-agnostic
● Approximation
Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Flipkart confidential - For Internal use only. Not to be shared externally.
Semantics for unbounded data
Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Windowing by
Processing time
Windowing by
Event time
Flipkart confidential - For Internal use only. Not to be shared externally.
Batch to fStream
● Streaming applications
○ f-SQL( ANSI-SQL compliant)
Flipkart confidential - For Internal use only. Not to be shared externally.
Batch to fStream
● Streaming applications
○ Materialized time windows
HBASE
Time Partitioned
Aggregates
Flipkart confidential - For Internal use only. Not to be shared externally.
Improvements
● Lower latency of freshness
○ User Insight prediction
○ Trust and Safety Interventions
● Newer features for data-science
○ User sessionization
● Lower resource consumption
Optimizing data platform to
improve predictability
Flipkart confidential - For Internal use only. Not to be shared externally.
Overload @ constant capacity
● More users, more use cases, more jobs, more resources
○ 100x Increase in compute hours
● Hardware unavailability in DataCenter to scale at same
rate
○ 1.1x increase in machine instances
Flipkart confidential - For Internal use only. Not to be shared externally.
Job analysis
● Problems in jobs are not obvious
● Lot of possible configurations - Hive, Hadoop, HDFS, Spark, JVM
● Inter-related settings
● Information & metrics are scattered
Flipkart confidential - For Internal use only. Not to be shared externally.
Optimizing compute usage
● Automated performance monitoring and tuning
tool
● Indicates best practices and tuning tips
● Best performance for every job
DR Elephant to the rescue
http://github.com/linkedin/dr-elephant/
Flipkart confidential - For Internal use only. Not to be shared externally.
Dr Elephant
Dashboard
Flipkart confidential - For Internal use only. Not to be shared externally.
Dr Elephant - Heuristics & Severity
Flipkart confidential - For Internal use only. Not to be shared externally.
Optimizing compute - Tez vs Mapreduce
● Tez creates DAG of tasks.
Compared to MR
○ No intermediate data written
○ Larger memory footprint
No one size fits all
● Assigner chooses compute engine
○ Container hours
○ Resources used
○ Configuration tweaking
Job
Assigner
TEZ
To be
scheduled
Compute
engine
chosen
MR
Flipkart confidential - For Internal use only. Not to be shared externally.
Optimizing storage
JSON AVRO ORC
Many Storage Formats
Flipkart confidential - For Internal use only. Not to be shared externally.
Which storage format?
ORC vs Avro vs Parquet vs Json
● ORC / Parquet scores over Avro/Json
○ Encoding, dictionaries, indexes, projection pushdown, predicate
pushdown
● Choose Parquet if highly nested structures.
○ Note: We are working on feature in ORC + hive to support
predicate push down and projection pushdown.
Flipkart confidential - For Internal use only. Not to be shared externally.
Optimized storage format
● Columnar format
● Integrated compression, indexes and stats
● Predicate push down & Projection push down
● Run length Encoding
Flipkart confidential - For Internal use only. Not to be shared externally.
Improvements
● ORC
○ ~80pc savings in storage, ~60pc savings in compute
● Dr. Elephant
○ 2000+ jobs improved
○ ~70pc savings in compute
● Tez
○ 10-100x improvement in processing speed
Data Governance
With great power comes great responsibility.
- Uncle Ben
Flipkart confidential - For Internal use only. Not to be shared externally.
Unreliability due to data issues
● What is source of truth for “Order Item Information”?
-- No way to annotate the data asset as blessed
● Why is this “Id” not in Data Platform?
-- Referential integrity constraints & validations are not supported
● Why Account-Id has invalid characters “%@#21323213”?
-- column is “account id” not just String.
● Why my data-table has yesterday’s data?
-- RCA of the dependencies is hard
Flipkart confidential - For Internal use only. Not to be shared externally.
Missing guard-rails & attribution
● Unrestricted usage of data
assets in the platform
● No minimum guarantees
on compute for Job execution
Flipkart confidential - For Internal use only. Not to be shared externally.
Lineage
● Data Assets Lineage
○ Easier RCA
○ Enables Reuse
○ Strategies to improve
data quality
Flipkart confidential - For Internal use only. Not to be shared externally.
1. Catalog of Data Assets
Schema & dependency definition
2. Classify and govern these assets
Attributes, tagging & security policies.
3. Collaboration capabilities around
these data assets
Ownership, accountability, subscriptions
What is Data Governance ?
Flipkart confidential - For Internal use only. Not to be shared externally.
Schema Tightening
● Why?
Identify data issue before entering the system
MicroService2
MicroService1
DATA
PLATFORM
INGESTION
Data Platform
AccountId:
ABC21312321333
AccountId:
FOO%%1231233
ERROR
AccountId:
ABC21312321333
Flipkart confidential - For Internal use only. Not to be shared externally.
Schema Tightening
How?
● Business Types
Eg AccountId, Price, OrderId
● Validations via JSON Schema
● Migrating to Schema
Tightened Entities
Flipkart confidential - For Internal use only. Not to be shared externally.
Data Quality Asserts
● Multiple Constraints
support
Eg. NULL Check, Variance, Referential
Change, Custom Query
● Auto triggered when fact is
finished
● Any one can Subscribe to an
Assert Rule
● Jira & Email integration
Flipkart confidential - For Internal use only. Not to be shared externally.
Org Queues
Why Org Queues?
Introduce fairness in allocation of Data Platform’s compute resources.
Optimize usage of already overloaded cluster, ensuring rogue jobs are preempted.
Features of Org Queue
● Guaranteed Minimum Compute
● Burstability & Pre-emption
● Sub queues of different sizes to improve reliability of P0 jobs
● Org Admins to manage the Users & Jobs in the queue
Flipkart confidential - For Internal use only. Not to be shared externally.
Features & Optimizations
● FStream
● Dr Elephant - Job Analysis
● Tez - Compute Engine
● ORC - Storage Format
Data Governance
● Dependency Lineage
● Schema Tightening
● DQ Asserts
● Org Queues
Summary
Challenges @ Scale
Overload Cluster @ Constant Capacity
Batch processing patterns
Data Quality issues
Missing guard-rails
Q & A
“Without big data, you are blind and deaf and in the
middle of Outer Ring Road.”
Flipkart confidential - For Internal use only. Not to be shared externally.
THANKS

More Related Content

Similar to Flipkart Data Platform @ Scale - slash n 2018 reprise

Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017Zhenxiao Luo
 
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Denodo
 
In-Memory Data Management Goes Mainstream - OpenSlava 2015
In-Memory Data Management Goes Mainstream - OpenSlava 2015In-Memory Data Management Goes Mainstream - OpenSlava 2015
In-Memory Data Management Goes Mainstream - OpenSlava 2015Software AG
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAnthony Scata
 
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?Tom Paseka
 
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsGraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsNicola Molinari
 
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...Amazon Web Services
 
Worst Practices in Data Warehouse Design
Worst Practices in Data Warehouse DesignWorst Practices in Data Warehouse Design
Worst Practices in Data Warehouse DesignKent Graziano
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
Single Source of Truth for Network Automation
Single Source of Truth for Network AutomationSingle Source of Truth for Network Automation
Single Source of Truth for Network AutomationAndy Davidson
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidImply
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Taro L. Saito
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makerszekeLabs Technologies
 
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...Flink Forward
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
 
OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age
OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age
OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age NETWAYS
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 

Similar to Flipkart Data Platform @ Scale - slash n 2018 reprise (20)

Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017
 
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
 
In-Memory Data Management Goes Mainstream - OpenSlava 2015
In-Memory Data Management Goes Mainstream - OpenSlava 2015In-Memory Data Management Goes Mainstream - OpenSlava 2015
In-Memory Data Management Goes Mainstream - OpenSlava 2015
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
 
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsGraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
 
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
 
Worst Practices in Data Warehouse Design
Worst Practices in Data Warehouse DesignWorst Practices in Data Warehouse Design
Worst Practices in Data Warehouse Design
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Single Source of Truth for Network Automation
Single Source of Truth for Network AutomationSingle Source of Truth for Network Automation
Single Source of Truth for Network Automation
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
 
Industrialiser spark
Industrialiser sparkIndustrialiser spark
Industrialiser spark
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age
OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age
OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Flipkart Data Platform @ Scale - slash n 2018 reprise

  • 1. Flipkart Data Platform @ Scale Arya Ketan, Rishabh Dua Engineers @ Flipkart Tech In God we trust. All others must bring data!
  • 2. Flipkart confidential - For Internal use only. Not to be shared externally. Agenda 1. Data @ Flipkart 2. Data platform architecture 3. Challenges @ Scale 4. Operating 5. Storage & Compute Optimizations 6. Data Governance
  • 4. Flipkart confidential - For Internal use only. Not to be shared externally. Who are the users? “Torture the data, and it will confess to anything.”
  • 5. Flipkart confidential - For Internal use only. Not to be shared externally. Big Data - no longer just a buzzword 80% DATA < 2 years old 15+ PB HDFS files 3 billion + events ingested daily 400 billion + container hours daily 30+ TB Ingested daily
  • 7. Flipkart confidential - For Internal use only. Not to be shared externally. Architecture
  • 8. Challenges @ Scale Operating ● Predictability ● Reliability
  • 10. Flipkart confidential - For Internal use only. Not to be shared externally. Challenges in batch processing Classic Batch pattern ● Fixed window cycles ● Repeated every window Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  • 11. Flipkart confidential - For Internal use only. Not to be shared externally. Challenges in batch processing ● Breaks down when used with sophisticated window strategies Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 ● Businesses crave more timely data ● Non even workload spreads Session
  • 12. Flipkart confidential - For Internal use only. Not to be shared externally. Stream processing patterns ● Stream ○ Low latency but approximate results ○ Unordered data of varying event-time skew ● Event time : which is the time at which events actually occurred. ● Processing time: which is the time at which events are observed in the system.
  • 13. Flipkart confidential - For Internal use only. Not to be shared externally. Lambda Architecture
  • 14. Flipkart confidential - For Internal use only. Not to be shared externally. Semantics for unbounded data ● Time-agnostic ● Approximation Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  • 15. Flipkart confidential - For Internal use only. Not to be shared externally. Semantics for unbounded data Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 Windowing by Processing time Windowing by Event time
  • 16. Flipkart confidential - For Internal use only. Not to be shared externally. Batch to fStream ● Streaming applications ○ f-SQL( ANSI-SQL compliant)
  • 17. Flipkart confidential - For Internal use only. Not to be shared externally. Batch to fStream ● Streaming applications ○ Materialized time windows HBASE Time Partitioned Aggregates
  • 18. Flipkart confidential - For Internal use only. Not to be shared externally. Improvements ● Lower latency of freshness ○ User Insight prediction ○ Trust and Safety Interventions ● Newer features for data-science ○ User sessionization ● Lower resource consumption
  • 19. Optimizing data platform to improve predictability
  • 20. Flipkart confidential - For Internal use only. Not to be shared externally. Overload @ constant capacity ● More users, more use cases, more jobs, more resources ○ 100x Increase in compute hours ● Hardware unavailability in DataCenter to scale at same rate ○ 1.1x increase in machine instances
  • 21. Flipkart confidential - For Internal use only. Not to be shared externally. Job analysis ● Problems in jobs are not obvious ● Lot of possible configurations - Hive, Hadoop, HDFS, Spark, JVM ● Inter-related settings ● Information & metrics are scattered
  • 22. Flipkart confidential - For Internal use only. Not to be shared externally. Optimizing compute usage ● Automated performance monitoring and tuning tool ● Indicates best practices and tuning tips ● Best performance for every job DR Elephant to the rescue http://github.com/linkedin/dr-elephant/
  • 23. Flipkart confidential - For Internal use only. Not to be shared externally. Dr Elephant Dashboard
  • 24. Flipkart confidential - For Internal use only. Not to be shared externally. Dr Elephant - Heuristics & Severity
  • 25. Flipkart confidential - For Internal use only. Not to be shared externally. Optimizing compute - Tez vs Mapreduce ● Tez creates DAG of tasks. Compared to MR ○ No intermediate data written ○ Larger memory footprint No one size fits all ● Assigner chooses compute engine ○ Container hours ○ Resources used ○ Configuration tweaking Job Assigner TEZ To be scheduled Compute engine chosen MR
  • 26. Flipkart confidential - For Internal use only. Not to be shared externally. Optimizing storage JSON AVRO ORC Many Storage Formats
  • 27. Flipkart confidential - For Internal use only. Not to be shared externally. Which storage format? ORC vs Avro vs Parquet vs Json ● ORC / Parquet scores over Avro/Json ○ Encoding, dictionaries, indexes, projection pushdown, predicate pushdown ● Choose Parquet if highly nested structures. ○ Note: We are working on feature in ORC + hive to support predicate push down and projection pushdown.
  • 28. Flipkart confidential - For Internal use only. Not to be shared externally. Optimized storage format ● Columnar format ● Integrated compression, indexes and stats ● Predicate push down & Projection push down ● Run length Encoding
  • 29. Flipkart confidential - For Internal use only. Not to be shared externally. Improvements ● ORC ○ ~80pc savings in storage, ~60pc savings in compute ● Dr. Elephant ○ 2000+ jobs improved ○ ~70pc savings in compute ● Tez ○ 10-100x improvement in processing speed
  • 30. Data Governance With great power comes great responsibility. - Uncle Ben
  • 31. Flipkart confidential - For Internal use only. Not to be shared externally. Unreliability due to data issues ● What is source of truth for “Order Item Information”? -- No way to annotate the data asset as blessed ● Why is this “Id” not in Data Platform? -- Referential integrity constraints & validations are not supported ● Why Account-Id has invalid characters “%@#21323213”? -- column is “account id” not just String. ● Why my data-table has yesterday’s data? -- RCA of the dependencies is hard
  • 32. Flipkart confidential - For Internal use only. Not to be shared externally. Missing guard-rails & attribution ● Unrestricted usage of data assets in the platform ● No minimum guarantees on compute for Job execution
  • 33. Flipkart confidential - For Internal use only. Not to be shared externally. Lineage ● Data Assets Lineage ○ Easier RCA ○ Enables Reuse ○ Strategies to improve data quality
  • 34. Flipkart confidential - For Internal use only. Not to be shared externally. 1. Catalog of Data Assets Schema & dependency definition 2. Classify and govern these assets Attributes, tagging & security policies. 3. Collaboration capabilities around these data assets Ownership, accountability, subscriptions What is Data Governance ?
  • 35. Flipkart confidential - For Internal use only. Not to be shared externally. Schema Tightening ● Why? Identify data issue before entering the system MicroService2 MicroService1 DATA PLATFORM INGESTION Data Platform AccountId: ABC21312321333 AccountId: FOO%%1231233 ERROR AccountId: ABC21312321333
  • 36. Flipkart confidential - For Internal use only. Not to be shared externally. Schema Tightening How? ● Business Types Eg AccountId, Price, OrderId ● Validations via JSON Schema ● Migrating to Schema Tightened Entities
  • 37. Flipkart confidential - For Internal use only. Not to be shared externally. Data Quality Asserts ● Multiple Constraints support Eg. NULL Check, Variance, Referential Change, Custom Query ● Auto triggered when fact is finished ● Any one can Subscribe to an Assert Rule ● Jira & Email integration
  • 38. Flipkart confidential - For Internal use only. Not to be shared externally. Org Queues Why Org Queues? Introduce fairness in allocation of Data Platform’s compute resources. Optimize usage of already overloaded cluster, ensuring rogue jobs are preempted. Features of Org Queue ● Guaranteed Minimum Compute ● Burstability & Pre-emption ● Sub queues of different sizes to improve reliability of P0 jobs ● Org Admins to manage the Users & Jobs in the queue
  • 39. Flipkart confidential - For Internal use only. Not to be shared externally. Features & Optimizations ● FStream ● Dr Elephant - Job Analysis ● Tez - Compute Engine ● ORC - Storage Format Data Governance ● Dependency Lineage ● Schema Tightening ● DQ Asserts ● Org Queues Summary Challenges @ Scale Overload Cluster @ Constant Capacity Batch processing patterns Data Quality issues Missing guard-rails
  • 40. Q & A “Without big data, you are blind and deaf and in the middle of Outer Ring Road.”
  • 41. Flipkart confidential - For Internal use only. Not to be shared externally. THANKS