This presentation was given at LinkedIn. It is a collection of guidelines and wisdom for re-thinking how we do engineering for massively scalable systems. Useful for anyone who cares about Big Data, Distributed Computing, Hadoop, and more.
1. How To Make Life Suck
Less!
(when building scalable systems)
Bradford Stephens
c: www.DrawnToScaleHQ.com
b: www.roadtofailure.com
t: @lusciouspear
2. About Me
• Founder, Drawn to Scale. Lead Engineer,
Visible Technologies
• CS Degree, University of North FL
• Former careers in politics, music, finance,
consulting
3. Drawn to Scale
• Building the “Big Data” platform: ingestion,
processing, storage, search
• Products coming: Big Log, Big Search
(faceted), Big Message...
5. Everything Changes
with Big Data
• Bar is set higher: a previously niche field,
few standard stacks (like LAMP)
• You need to have better engineering for
minimum success
6. Scalability Matters
• “Web-Scale” data is unstructured and
exponentially interconnected
• Social Media: Catalyst
• All data is important
• Data Size != Business Size
7. The Traditional DB
• Excel with highly structured, normalizable
data
• Non-Linear Scale Cost
• More data = less features
• Optimized for single-node
• 90% of utility is 5% of capability
8. Ergo, Distributed
• Optimize for the problems, no Swiss-Army
knife
• Shared-nothing, commodity boxes
• Linear scale cost
9. The State of Things
• Order changed from 20 years ago:
• Cust. Experience is paramount
• Engineers are precious
• Fast I/O is expensive
• Storage is cheap
11. Operations
Moving the Box: Sysadmin ratio from 2:1 to
200:1 to 2000:1
(yes devs, you’ll care about this too)
12. Ops vs. Eng
• Engineers build, Ops manages
• Fixing problems: devs code+automate, ops
hire
• Want something fixed? Call devs at 2 AM.
13. Config is Important
• Configuration is not 2nd-class anymore
• Needs to be tackled by Engineers
• New frameworks = months of
configuration and experimentation
• Chef is a good start, but...
14. Production = Test
• Surprise! You don’t have a Test environment
any more.
• Test Cost => Prod Cost
• Anything that’s not your data center is an
approximation. Switches, cable, power,
boxes, etc...
15. You’re Always Testing
• Constantly simulate failures and brownouts
of boxes, racks, switches...
• “Canary in the Coal Mine”: run a box and
rack at 175% current load.
17. Built to Fail
• “It’s working” isn’t binary
• Acting weird? Shoot it.
• Multi-system failure is common: be
topology aware
• Avoid false negative: something’s wrong and
you don’t know it, lose customer data
• This is empowering!
19. This is Hard :(
• Engineering at scale is very different than
writing a 3-tier webapp
• Care about garbage collection, election
algorithms, data structures, access patterns,
etc...
• CS knowledge is required, not a luxury
• DBA/RDBMS skills pretty useless
• CAP is law
20. Not Everything’s a Table
• Structure your data according to how it
needs to be used
• Unstructured massive files, graphs, KV-
stores
• The more your problem narrows, the
easier it is to scale
21. Big Data is BIG
• Imagine your test passes taking hours
• What works at 1.5 TB may fail at 10MB or
2 TB
• Many tests, simple code
• Soft Delete Only
22. “No, I won’t give you a
repro”
• Often impossible to repro a bug on
demand in a cluster
• Either fix your logging or your bug
• Log everything (we have a product for this!)
23. Avoiding Impedance
Mismatch
• High vs. Low Latency vs. Throughput
• A lot of data eventually, or a little now
• MapReduce vs. Sharding/Indexing
24. Simple Workflow
Semantic Unstructured
Hadoop Collect
Analysis Analysis
Structured
Analysis
Hadoop + Store in
HBase HBase
Store in
Indexing
Hadoop
Lucene+ Load/
Pull
Solr+ Replicate
Indexes
Katta Shards Search
26. Hiring
• Plan for more engineers, less ops
• Be aware of “context switch cost” when
training RDBMS-folks
27. It’s Not Just Coding
• Be aware of research cost
• Much more time spent experimenting, not
coding
• Coding all this from scratch is horrific
• Nailing together 10+ OSS projects is a pain
• Open source anything not “Secret sauce”
28. Solve your Core
Problem
• “Making your own electricity doesn’t create
better tasting beer”
• Plan to use an end-to-end platform in the
future (hint: ours!)
29. In Summary
• Plan for everything to fail
• Test constantly in production
• Systems Software requires Computer
Science
• Don’t build it if you don’t have to
30. Thanks!
• Ya’ll
• Road to Failure Readers
• James Hamilton, Amazon/MS
• Bradford Cross, Flightcaster
• Ryan Rawson, HBase/Stumbleupon