What does it take to scale a system? We'll learn how going distributed can pay dividends in areas like availability and fault tolerance by examining a real-world case study. However, we will also look at the inherent pitfalls. When it comes to distributed systems, for every promise there is a peril.
Powering Real-Time Decisions with Continuous Data Streams
The Economics of Scale: Promises and Perils of Going Distributed
1. The Economics of Scale Tyler Treat
WorkivaPromises and Perils of Going Distributed
September 19, 2015
2. About The Speaker
• Backend engineer at Workiva
• Messaging platform tech lead
• Distributed systems
• bravenewgeek.com @tyler_treat
tyler.treat@workiva.com
3. About The Talk
• Why distributed systems?
• Case study
• Advantages/Disadvantages
• Strategies for scaling and resilience patterns
• Scaling Workiva
5. Scale Up vs. Scale Out
❖ Add resources to a node
❖ Increases node capacity, load
is unaffected
❖ System complexity unaffected
Vertical Scaling
❖ Add nodes to a cluster
❖ Decreases load, capacity is
unaffected
❖ Availability and throughput w/
increased complexity
Horizontal Scaling
22. Partition tweets into different databases using
some consistent hash scheme
(put a hash ring on it).
23.
24. This alleviates lock contention and improves
throughput…
but fetching timelines is still extremely costly
(now scatter-gather query across multiple DBs).
25. Observation: Twitter is a consumption mechanism
more than an ingestion one…
i.e. cheap reads > cheap writes
27. Ingestion/Fan-Out Process
1. Tweet comes in
2. Query the social graph service for followers
3. Iterate through each follower and insert tweet ID into
their timeline (stored in Redis)
4. Store tweet on disk (MySQL)
28.
29. Ingestion/Fan-Out Process
• Lots of processing on ingest, no computation on reads
• Redis stores timelines in memory—very fast
• Fetching timeline involves no queries—get timeline from
Redis cache and rehydrate with multi-get on IDs
• If timeline falls out of cache, reconstitute from disk
• O(n) on writes, O(1) on reads
• http://www.infoq.com/presentations/Twitter-Timeline-Scalability
30. Key Takeaway: think about your access patterns
and design accordingly.
Optimize for the critical path.
31. Let’s Recap…
• Advantages of single database system:
• Simple!
• Data and invariants are consistent (ACID transactions)
• Disadvantages of single database system:
• Slow
• Doesn’t scale
• Single point of failure
37. Sure, just coordinate things before proceeding…
“Have you seen this tweet? Okay, good.”
“Have you seen this tweet? Okay, good.”
“Have you seen this tweet? Okay, good.”
“Have you seen this tweet? Okay, good.”
“Have you seen this tweet? Okay, good.”
“Have you seen this tweet? Okay, good.”
38. Sooo what do you do when Justin Bieber tweets to
his 67 million followers?
39.
40. Coordinating for consistency is expensive
when data is distributed
because processes
can’t make progress independently.
41.
42.
43. Source: Peter Bailis, 2015 https://speakerdeck.com/pbailis/silence-is-golden-coordination-avoiding-systems-design
44. Key Takeaway: strong consistency is slow and
distributed coordination is expensive (in terms of
latency and throughput).
67. Flow-Control Mechanisms
• Rate limit
• Bound queues/buffers
• Backpressure - drop messages on the floor
• Increment stat counters for monitoring/alerting
• Exponential back-off
• Use application-level acks for critical transactions
68. Bounding resource utilization and failing fast
helps maintain predictable performance and
impedes cascading failures.