SlideShare a Scribd company logo
1 of 31
Download to read offline
Things might go wrong in a
data-intensive application
Petertc Chu | PyConline AU 2021
Scope
Applications deal with huge volumes of data
- Web applications, mobile apps, IoT...
Challenges
- “the quantity of data, the complexity of data, the speed
at which it is changing”
Key factors
- Scalability, Reliability
(dataintensive.net)
About me
Research engineer and Pythonista from Taiwan
Working on data infrastructures for ten years
kiwislife.com
The case
Host and manage UGC (User-generated content) with various usage patterns
- Streaming, IoT data aggregation, file distribution, archiving...
- ~10PiB raw capacity
- Processing several TiBs per day
We can cover a football field if we put all our disks on the ground
Structured data store
Sharding / partitioning,
RDMBS clusters,
NoSQL...
Concepts
Cache layer
Unstructured data store
Various kind of DFSs,
heterogeneous storage
media
Application
servers
Job processing
systems,
Other
subsystems
Various usage patterns
Incident #1
What happened?
Thousands of IoT devices push data to
our cluster 24-7-365, got
- error rate: ~30%
- Avg RTT: 39.005s
The build up
DB race condition
- Optimistic locking doesn’t help in this pattern (W >> R)
databases
IoT
devices
application
servers
contention
occurred! 😱
😡
The build up
Pessimistic locking is too expensive for other usage patterns
databases
IoT
devices
application
servers
Implement global
locking
🚘🚘
🚘
🚘🚘
🚘
🚘
🚘
🚘
other users
😡
😡
😡
👍
The build up
Final: a hybrid / adaptive approach
- Only do pessimistic locking for specific operations
- Do locking in local by default
- Switch to global locking for specific resource automatically while collision detected
- (switch back after a certain duration)
- Keep using optimistic locking otherwise
The build up
Final: a hybrid / adaptive approach
databases
IoT
devices
application
servers
local lock
local lock
local lock
(Global lock)
other users
👍
👍
👍
👍
Root cause #scalability
We don’t design for a usage pattern and workload like that
Action taken
- Test concurrency scenarios before each release
- Introduce observability and proactive monitoring systems for quick incident
detection and diagnosis
Incident #2
What
happened?
We have an advanced data management feature
- Not production ready, just a prototype
- No one use it for several years
One day, a user discovered it and made a million
times more requests to this subsystem!!
The build up
We needed some kind of distributed solution to handle this.
- resque: a Redis-backed framework for creating background jobs
https://github.blog/2009-11-03-introducing-resque/ https://gist.github.com/defunkt/225369
Root cause #scalability
Load exceeds expectations
Action taken
- All batch processing subsystems are now implemented in a distributed way
Incident #3
What
happened?
A supplier built a data protection subsystem for us
...after we deployed it...
Users complain data corruption!!
The build up
Defective padding in the encryption process
Example 1:
Input data: “DD” * 12
Expected result:
| DD DD DD DD DD DD DD DD | DD DD DD DD 04 04 04 04 |
Example 2:
Input data: “DD” * 16
Expected result:
| DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD |
| 16 16 16 16 16 16 16 16 | 16 16 16 16 16 16 16 16 |
Incorrect result:
| DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD |
(If the length of the original data is an integer multiple of the block size B,
then an extra block of bytes with value B is added. B is 16 in this case.)
The build up
Design a process to fix all affected data
- List all affected records from DBs
- Read corresponding data with an “incorrect” decryption algorithm
- Write data back with a correct encryption algorithm
Id Size Encryption method Version number Data reference key
1 32 (Not encrypted) 0 aaa
2 6 Indefective algorithm 0 bbb
3 5 (not affected) Defective algorithm 0 ccc
4 32 (affected) Defective algorithm 1 (fixed) ddd
5 64 (affected) Defective algorithm 0 (not yet fixed) eee
Only the last one needs a fix (block size = 16)
The build up
Just a silly bug, if it didn’t affect…
- Millions of user records
We set up a job processing system to correct all affected data in our system
gearman [Gearman Job Server] https://github.com/Yelp/python-gearman
Root cause #reliability #softwareFaults
1. Unreliable solution provider
2. Less than 1% possibility to find the bug by testing
Action taken
- Not outsourcing anymore
- More comprehensive tests with various kinds of scenarios
- ~10 TiB test dataset
Incident #4
What
happened?
To keep reliability, we
- Replicate user data multiple times
- Distribute replicas to different failure domains
(different host/data center)
Data still lost!!
http://dx.doi.org/10.6861/tanet.201810.0398
The build up
Our system balances loading by writing data into nodes that have more resource
- A newly added node has more resource in general
- Result in data tend to be placed on new nodes
Data are written to unreliable newly added nodes and lost even though they are
distributed in different failure domains.
Topic: Electronic/Electrical Reliability (cmu.edu)
Root cause #reliability #hardwareFaults
It’s hard to prevent data loss completely
- Modeling or simulation cannot truly reflect situations in
real world
Action taken
- Do more stability tests on new coming nodes
- Add a batch of new nodes each time, so it has less
opportunity to write data into an unreliable node
http://dx.doi.org/10.6861/tanet.201810.0398
What do we learn
from these
incidents?🤔
#1 “There is unfortunately no easy fix for
making applications reliable, scalable”
- No way to enumerate all possible reliability causes (hardware faults,
software faults, human errors)
- Usage pattern and load keep changing while your business
expanded, cannot have an ultimate scalability design beforehand
#2 Before trying to build a faultless
architecture, think twice
- Consider maintainability
- We need a team to sustain a large-scale system, not just a talented engineer
(dataintensive.net)
#3 Service = human beings + machines
Thank you! 🙏🙏🙏
@petertc_chu

More Related Content

What's hot

What's hot (20)

Data Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETLData Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETL
 
Software cracking and patching
Software cracking and patchingSoftware cracking and patching
Software cracking and patching
 
Distributed Tracing with OpenTracing, ZipKin and Kubernetes
Distributed Tracing with OpenTracing, ZipKin and KubernetesDistributed Tracing with OpenTracing, ZipKin and Kubernetes
Distributed Tracing with OpenTracing, ZipKin and Kubernetes
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 
EKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern FragmentsEKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern Fragments
 
Self driving computers active learning workflows with human interpretable ve...
Self driving computers  active learning workflows with human interpretable ve...Self driving computers  active learning workflows with human interpretable ve...
Self driving computers active learning workflows with human interpretable ve...
 
Advanced deeplearning4j features
Advanced deeplearning4j featuresAdvanced deeplearning4j features
Advanced deeplearning4j features
 
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBHow a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
 
Boolan machine learning summit
Boolan machine learning summitBoolan machine learning summit
Boolan machine learning summit
 
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
 
IoT Data Connector Fluent Bit
IoT Data Connector Fluent BitIoT Data Connector Fluent Bit
IoT Data Connector Fluent Bit
 
Open Tracing, to order and understand your mess. - ApiConf 2017
Open Tracing, to order and understand your mess. - ApiConf 2017Open Tracing, to order and understand your mess. - ApiConf 2017
Open Tracing, to order and understand your mess. - ApiConf 2017
 
Performance monitoring and call tracing in microservice environments
Performance monitoring and call tracing in microservice environmentsPerformance monitoring and call tracing in microservice environments
Performance monitoring and call tracing in microservice environments
 
Brief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep LearningBrief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep Learning
 
Fluent-bit
Fluent-bitFluent-bit
Fluent-bit
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talk
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 

Similar to PyConline AU 2021 - Things might go wrong in a data-intensive application

The Growth Of Data Centers
The Growth Of Data CentersThe Growth Of Data Centers
The Growth Of Data Centers
Gina Buck
 
Data Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloningData Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloning
Kyle Hailey
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
programmermag
 
Nyoug delphix slideshare
Nyoug delphix slideshareNyoug delphix slideshare
Nyoug delphix slideshare
Kyle Hailey
 

Similar to PyConline AU 2021 - Things might go wrong in a data-intensive application (20)

Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
The Growth Of Data Centers
The Growth Of Data CentersThe Growth Of Data Centers
The Growth Of Data Centers
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
 
Production debugging web applications
Production debugging web applicationsProduction debugging web applications
Production debugging web applications
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Data Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloningData Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloning
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Nyoug delphix slideshare
Nyoug delphix slideshareNyoug delphix slideshare
Nyoug delphix slideshare
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and OptimizeISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
 
Production Debugging War Stories
Production Debugging War StoriesProduction Debugging War Stories
Production Debugging War Stories
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011
 
Roberto minerva 20181130
Roberto minerva 20181130  Roberto minerva 20181130
Roberto minerva 20181130
 

Recently uploaded

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Recently uploaded (20)

WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid Environments
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
 
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAMWSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
 
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
 
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 

PyConline AU 2021 - Things might go wrong in a data-intensive application

  • 1. Things might go wrong in a data-intensive application Petertc Chu | PyConline AU 2021
  • 2. Scope Applications deal with huge volumes of data - Web applications, mobile apps, IoT... Challenges - “the quantity of data, the complexity of data, the speed at which it is changing” Key factors - Scalability, Reliability (dataintensive.net)
  • 3. About me Research engineer and Pythonista from Taiwan Working on data infrastructures for ten years kiwislife.com
  • 4. The case Host and manage UGC (User-generated content) with various usage patterns - Streaming, IoT data aggregation, file distribution, archiving... - ~10PiB raw capacity - Processing several TiBs per day We can cover a football field if we put all our disks on the ground
  • 5. Structured data store Sharding / partitioning, RDMBS clusters, NoSQL... Concepts Cache layer Unstructured data store Various kind of DFSs, heterogeneous storage media Application servers Job processing systems, Other subsystems Various usage patterns
  • 7. What happened? Thousands of IoT devices push data to our cluster 24-7-365, got - error rate: ~30% - Avg RTT: 39.005s
  • 8. The build up DB race condition - Optimistic locking doesn’t help in this pattern (W >> R) databases IoT devices application servers contention occurred! 😱 😡
  • 9. The build up Pessimistic locking is too expensive for other usage patterns databases IoT devices application servers Implement global locking 🚘🚘 🚘 🚘🚘 🚘 🚘 🚘 🚘 other users 😡 😡 😡 👍
  • 10. The build up Final: a hybrid / adaptive approach - Only do pessimistic locking for specific operations - Do locking in local by default - Switch to global locking for specific resource automatically while collision detected - (switch back after a certain duration) - Keep using optimistic locking otherwise
  • 11. The build up Final: a hybrid / adaptive approach databases IoT devices application servers local lock local lock local lock (Global lock) other users 👍 👍 👍 👍
  • 12. Root cause #scalability We don’t design for a usage pattern and workload like that Action taken - Test concurrency scenarios before each release - Introduce observability and proactive monitoring systems for quick incident detection and diagnosis
  • 14. What happened? We have an advanced data management feature - Not production ready, just a prototype - No one use it for several years One day, a user discovered it and made a million times more requests to this subsystem!!
  • 15. The build up We needed some kind of distributed solution to handle this. - resque: a Redis-backed framework for creating background jobs https://github.blog/2009-11-03-introducing-resque/ https://gist.github.com/defunkt/225369
  • 16. Root cause #scalability Load exceeds expectations Action taken - All batch processing subsystems are now implemented in a distributed way
  • 18. What happened? A supplier built a data protection subsystem for us ...after we deployed it... Users complain data corruption!!
  • 19. The build up Defective padding in the encryption process Example 1: Input data: “DD” * 12 Expected result: | DD DD DD DD DD DD DD DD | DD DD DD DD 04 04 04 04 | Example 2: Input data: “DD” * 16 Expected result: | DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD | | 16 16 16 16 16 16 16 16 | 16 16 16 16 16 16 16 16 | Incorrect result: | DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD | (If the length of the original data is an integer multiple of the block size B, then an extra block of bytes with value B is added. B is 16 in this case.)
  • 20. The build up Design a process to fix all affected data - List all affected records from DBs - Read corresponding data with an “incorrect” decryption algorithm - Write data back with a correct encryption algorithm Id Size Encryption method Version number Data reference key 1 32 (Not encrypted) 0 aaa 2 6 Indefective algorithm 0 bbb 3 5 (not affected) Defective algorithm 0 ccc 4 32 (affected) Defective algorithm 1 (fixed) ddd 5 64 (affected) Defective algorithm 0 (not yet fixed) eee Only the last one needs a fix (block size = 16)
  • 21. The build up Just a silly bug, if it didn’t affect… - Millions of user records We set up a job processing system to correct all affected data in our system gearman [Gearman Job Server] https://github.com/Yelp/python-gearman
  • 22. Root cause #reliability #softwareFaults 1. Unreliable solution provider 2. Less than 1% possibility to find the bug by testing Action taken - Not outsourcing anymore - More comprehensive tests with various kinds of scenarios - ~10 TiB test dataset
  • 24. What happened? To keep reliability, we - Replicate user data multiple times - Distribute replicas to different failure domains (different host/data center) Data still lost!! http://dx.doi.org/10.6861/tanet.201810.0398
  • 25. The build up Our system balances loading by writing data into nodes that have more resource - A newly added node has more resource in general - Result in data tend to be placed on new nodes Data are written to unreliable newly added nodes and lost even though they are distributed in different failure domains. Topic: Electronic/Electrical Reliability (cmu.edu)
  • 26. Root cause #reliability #hardwareFaults It’s hard to prevent data loss completely - Modeling or simulation cannot truly reflect situations in real world Action taken - Do more stability tests on new coming nodes - Add a batch of new nodes each time, so it has less opportunity to write data into an unreliable node http://dx.doi.org/10.6861/tanet.201810.0398
  • 27. What do we learn from these incidents?🤔
  • 28. #1 “There is unfortunately no easy fix for making applications reliable, scalable” - No way to enumerate all possible reliability causes (hardware faults, software faults, human errors) - Usage pattern and load keep changing while your business expanded, cannot have an ultimate scalability design beforehand
  • 29. #2 Before trying to build a faultless architecture, think twice - Consider maintainability - We need a team to sustain a large-scale system, not just a talented engineer (dataintensive.net)
  • 30. #3 Service = human beings + machines