Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Brief, Rapid History
of Scaling Instagram

(with a tiny team)
Mike Krieger
QConSF 2013
!
Hello!
Instagram
30 million with 2 eng
(2010-end 2012)
150 million with 6 eng
(2012-now)
How we scaled
What I would have
done differently
What tradeoffs you make when
scaling with that size team
(if you can help it, have a
bigger team)
perfect solutions
survivor bias
decision-making process
Core principles
Do the simplest thing first
Every infra moving part is another
“thread” your team has to manage
Test & Monitor
Everything
This talk
Early days
Year 1: Scaling Up
Year 2: Scaling Out
Year 3-present: Stability, Video, FB
Getting Started
2010
2 guys on a pier
no one <3s it
Focus
Mike iOS, Kevin Server
Early Stack
Django + Apache mod_wsgi
Postgres
Redis
Gearman
Memcached
Nginx
If today
Django + uWSGI
Postgres
Redis
Celery
Memcached
HAproxy
Three months later
Server planning night
before launch
Traction!
Year 1: Scaling Up
scaling.enable()
Single server in LA
infra newcomers
“What’s a load average?”
“Can we get another
server?”
Doritos &
Red Bull &
Animal Crackers &
Amazon EC2
Underwater on recruiting
2 total engineers
Scale "just enough" to get
back to working on app
Every weekend was an
accomplishment
“Infra is what happens when you’re busy
making other plans”
—Ops Lennon
Scaling up DB
First bottleneck: disk IO
on old Amazon EBS
At the time: ~400 IOPS
max
Simple thing first
Vertical partitioning
Django DB Routers
Partitions
Media
Likes
Comments
Everything else
PG Replication to
bootstrap nodes
Bought us some time
Almost no application logic changes
(other than some primary keys)
Today: SSD and provisioned
IOPS get you way further
Scaling up Redis
Purely RAM-bound
fork() and COW
Vertical partitioning by
data type
No easy migration story;
mostly double-writing
Replicating + deleting
often leaves fragmentation
Chaining replication =
awesome
Scaling Memcached
Consistent hashing /
ketama
Mind that hash function
Why not Redis for kv
caching?
Slab allocator
Config Management
& Deployment
fabric + parallel git pull
(sorry GitHub)
All AMI based snapshots
for new instances
update_ami.sh
update_update_ami.sh
Should have done Chef
earlier
Munin monitoring
df, CPU, iowait
Ending the year
Infra going from 10% time
to 70%
Focus on client
Testing & monitoring kept
concurrent fires to a minimum
Several ticking time
bombs
Year 2: Scaling Out
App tier
Stateless, but plentiful
HAProxy
(Dead node detection)
Connection limits
everywhere
PGBouncer
Homegrown Redis pool
Hard to track down kernel
panics
Skip rabbit hole; use instance-
status to detect and restart
Database Scale Out
Out of IO again
(Pre SSDs)
Biggest mis-step
NoSQL?
Call our friends
and strangers
Theory: partitioning and rebalancing
are hard to get right,
let DB take care of it
MongoDB (1.2 at the
time)
Double write, shadow
reads
Stressing about Primary Key
Placed in prod
Data loss, segfaults
Could have made it
work…
…but it would have been
someone’s full time job
(and we still only had 3
people)
train + rapidly
approaching cliff
Sharding in Postgres
QCon to the rescue
Similar approach to FB
(infra foreshadowing?)
Logical partitioning, done
at application level
Simplest thing; skipped
abstractions & proxies
Pre-split
5000 partitions
note to self: pick a power
of 2 next time
Postgres "schemas"
database
schema
table
columns
machineA:
shard0
photos_by_user
shard1
photos_by_user
shard2
photos_by_user
shard3
photos_by_user
machineA:
shard0
photos_by_user
shard1
photos_by_user
shard2
photos_by_user
shard3
photos_by_user
machineA’:
shard0
photos...
machineA:
shard0
photos_by_user
shard1
photos_by_user
shard2
photos_by_user
shard3
photos_by_user
machineA’:
shard0
photos...
Still how we scale PG
today
9.2 upgrade: bucardo to
move schema by schema
ID generation
Requirements
No extra moving parts
64 bits max
Time ordered
Containing partition key
41 bits: time in millis (41 years of IDs)
	 	 13 bits: logical shard ID
	 	 10 bits: auto-incrementing
sequence, modulo 10...
This means we can generate 1024
IDs, per shard, per table, per
millisecond
Lesson learned
A new db is a full time
commitment
Be thrifty with your
existing tech
= minimize moving parts
Scaling configs/host
discovery
ZooKeeper or DNS
server?
No team to maintain
/etc/hosts
ec2tag KnownAs
fab update_etc_hosts
(generates, deploys)
Limited: dead host
failover, etc
But zero additional infra, got
the job done, easy to debug
Monitoring
Munin: too coarse, too
hard to add new stats
StatsD & Graphite
Simple tech
statsd.timer
statsd.incr
Step change in developer
attitude towards stats
<5 min from wanting to
measure, to having a graph
580 statsd counters
164 statsd timers
Ending the year
Launched Android
(doubling all of our infra, most of
which was now horizontally scalable)
Doubled active users in
< 6 months
Finally, slowly, building up
team
Year 3+: Stability,
Video, FB
Scale tools to match
team
Deployment &
Config Management
Finally 100% on Chef
Simple thing first: knife
and chef-solo
Every new hire learns
Chef
Code deploys
Many rollouts a day
Continuous integration
But push still needs a
driver
"Ops Lock"
Humans are terrible
distributed locking systems
Sauron
Redis-enforced locks
Rollout / major config changes
/ live deployment tracking
Extracting approach
Hit issue
Develop manual approach
Build tools to improve manual / hands on
approach
Replace manual wit...
Monitoring
Munin finally broke
Ganglia for graphing
Sensu for alerting
(http://sensuapp.org)
StatsD/Graphite still
chugging along
waittime: lightweight slow
component tracking
s = time.time()
# do work
statsd.incr("waittime.VIEWNAME.C
OMPONENT", time.time() - s)
asPercent()
Feeds and Inboxes
Redis
In memory requirement
Every churned or inactive user
Inbox moved to
Cassandra
1000:1 write/read
Prereq: having rbranson,
ex-DataStax
C* cluster is 20% of the
size of Redis one
Main feed (timeline) still in
Redis
Knobs
Dynamic ramp-ups and
config
Previously: required
deploy
knobs.py
Only ints
Stored in Redis
Refreshed every 30s
knobs.get(feature_name,
default)
Uses
Incremental feature rollouts
Dynamic page sizing (shedding load)
Feature killswitches
As more teams around
FB contribute
Decouple deploy from
feature rollout
Video
Launch a top 10 video site on day
1 with team of 6 engineers,
in less than 2 months
Reuse what we know
Avoid magic middleware
VXCode
Separate from main App
servers
Django-based
server-side transcoding
ZooKeeper ephemeral
nodes for detection
(finally worth it / doable to
deploy ZK)
EC2 autoscaling
Priority list for clients
Transcoding tier is
completely stateless
statsd waterfall
holding area for
debugging bad videos
5 million videos in first day
40h of video / hour
(other than perf improvements we’ve
basically not touched it since launch)
FB
Where can we skip a few
years?
(at our own pace)
Spam fighting
re.compile(‘f[o0][1l][o0]w’)
Simplest thing did not last
Generic features +
machine learning
Hadoop + Hive + Presto
"I wonder how they..."
Two-way exchange
2010 vintage infra
#1 impact: recruiting
Backend team: >10
people now
Wrap up
Core principles
Do the simplest thing first
Every infra moving part is another
“thread” your team has to manage
Test & Monitor
Everything
Takeaways
Recruit way earlier than
you'd think
Simple doesn't always
imply hacky
Rocketship scaling has been
(somewhat) democratized
Huge thanks to IG
Eng Team
mikeyk@instagram.com
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Upcoming SlideShare
Loading in …5
×

Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)

Mike Krieger discusses Instagram's best and worst infrastructure decisions, building and deploying scalable and extensible services. Audio here: https://soundcloud.com/jldavid/mike-krieger-how-a-small-team-scales-instagram

  • Be the first to comment

Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)

  1. 1. A Brief, Rapid History of Scaling Instagram
 (with a tiny team) Mike Krieger QConSF 2013 !
  2. 2. Hello!
  3. 3. Instagram
  4. 4. 30 million with 2 eng (2010-end 2012)
  5. 5. 150 million with 6 eng (2012-now)
  6. 6. How we scaled
  7. 7. What I would have done differently
  8. 8. What tradeoffs you make when scaling with that size team
  9. 9. (if you can help it, have a bigger team)
  10. 10. perfect solutions
  11. 11. survivor bias
  12. 12. decision-making process
  13. 13. Core principles
  14. 14. Do the simplest thing first
  15. 15. Every infra moving part is another “thread” your team has to manage
  16. 16. Test & Monitor Everything
  17. 17. This talk Early days Year 1: Scaling Up Year 2: Scaling Out Year 3-present: Stability, Video, FB
  18. 18. Getting Started
  19. 19. 2010 2 guys on a pier
  20. 20. no one <3s it
  21. 21. Focus
  22. 22. Mike iOS, Kevin Server
  23. 23. Early Stack Django + Apache mod_wsgi Postgres Redis Gearman Memcached Nginx
  24. 24. If today Django + uWSGI Postgres Redis Celery Memcached HAproxy
  25. 25. Three months later
  26. 26. Server planning night before launch
  27. 27. Traction!
  28. 28. Year 1: Scaling Up
  29. 29. scaling.enable()
  30. 30. Single server in LA
  31. 31. infra newcomers
  32. 32. “What’s a load average?”
  33. 33. “Can we get another server?”
  34. 34. Doritos & Red Bull & Animal Crackers & Amazon EC2
  35. 35. Underwater on recruiting
  36. 36. 2 total engineers
  37. 37. Scale "just enough" to get back to working on app
  38. 38. Every weekend was an accomplishment
  39. 39. “Infra is what happens when you’re busy making other plans” —Ops Lennon
  40. 40. Scaling up DB
  41. 41. First bottleneck: disk IO on old Amazon EBS
  42. 42. At the time: ~400 IOPS max
  43. 43. Simple thing first
  44. 44. Vertical partitioning
  45. 45. Django DB Routers
  46. 46. Partitions Media Likes Comments Everything else
  47. 47. PG Replication to bootstrap nodes
  48. 48. Bought us some time
  49. 49. Almost no application logic changes (other than some primary keys)
  50. 50. Today: SSD and provisioned IOPS get you way further
  51. 51. Scaling up Redis
  52. 52. Purely RAM-bound
  53. 53. fork() and COW
  54. 54. Vertical partitioning by data type
  55. 55. No easy migration story; mostly double-writing
  56. 56. Replicating + deleting often leaves fragmentation
  57. 57. Chaining replication = awesome
  58. 58. Scaling Memcached
  59. 59. Consistent hashing / ketama
  60. 60. Mind that hash function
  61. 61. Why not Redis for kv caching?
  62. 62. Slab allocator
  63. 63. Config Management & Deployment
  64. 64. fabric + parallel git pull (sorry GitHub)
  65. 65. All AMI based snapshots for new instances
  66. 66. update_ami.sh
  67. 67. update_update_ami.sh
  68. 68. Should have done Chef earlier
  69. 69. Munin monitoring
  70. 70. df, CPU, iowait
  71. 71. Ending the year
  72. 72. Infra going from 10% time to 70%
  73. 73. Focus on client
  74. 74. Testing & monitoring kept concurrent fires to a minimum
  75. 75. Several ticking time bombs
  76. 76. Year 2: Scaling Out
  77. 77. App tier
  78. 78. Stateless, but plentiful
  79. 79. HAProxy (Dead node detection)
  80. 80. Connection limits everywhere
  81. 81. PGBouncer Homegrown Redis pool
  82. 82. Hard to track down kernel panics
  83. 83. Skip rabbit hole; use instance- status to detect and restart
  84. 84. Database Scale Out
  85. 85. Out of IO again (Pre SSDs)
  86. 86. Biggest mis-step
  87. 87. NoSQL?
  88. 88. Call our friends
  89. 89. and strangers
  90. 90. Theory: partitioning and rebalancing are hard to get right, let DB take care of it
  91. 91. MongoDB (1.2 at the time)
  92. 92. Double write, shadow reads
  93. 93. Stressing about Primary Key
  94. 94. Placed in prod
  95. 95. Data loss, segfaults
  96. 96. Could have made it work…
  97. 97. …but it would have been someone’s full time job
  98. 98. (and we still only had 3 people)
  99. 99. train + rapidly approaching cliff
  100. 100. Sharding in Postgres
  101. 101. QCon to the rescue
  102. 102. Similar approach to FB (infra foreshadowing?)
  103. 103. Logical partitioning, done at application level
  104. 104. Simplest thing; skipped abstractions & proxies
  105. 105. Pre-split
  106. 106. 5000 partitions
  107. 107. note to self: pick a power of 2 next time
  108. 108. Postgres "schemas"
  109. 109. database schema table columns
  110. 110. machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
  111. 111. machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user machineA’: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
  112. 112. machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user machineA’: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
  113. 113. Still how we scale PG today
  114. 114. 9.2 upgrade: bucardo to move schema by schema
  115. 115. ID generation
  116. 116. Requirements No extra moving parts 64 bits max Time ordered Containing partition key
  117. 117. 41 bits: time in millis (41 years of IDs) 13 bits: logical shard ID 10 bits: auto-incrementing sequence, modulo 1024.
  118. 118. This means we can generate 1024 IDs, per shard, per table, per millisecond
  119. 119. Lesson learned
  120. 120. A new db is a full time commitment
  121. 121. Be thrifty with your existing tech
  122. 122. = minimize moving parts
  123. 123. Scaling configs/host discovery
  124. 124. ZooKeeper or DNS server?
  125. 125. No team to maintain
  126. 126. /etc/hosts
  127. 127. ec2tag KnownAs
  128. 128. fab update_etc_hosts (generates, deploys)
  129. 129. Limited: dead host failover, etc
  130. 130. But zero additional infra, got the job done, easy to debug
  131. 131. Monitoring
  132. 132. Munin: too coarse, too hard to add new stats
  133. 133. StatsD & Graphite
  134. 134. Simple tech
  135. 135. statsd.timer statsd.incr
  136. 136. Step change in developer attitude towards stats
  137. 137. <5 min from wanting to measure, to having a graph
  138. 138. 580 statsd counters 164 statsd timers
  139. 139. Ending the year
  140. 140. Launched Android
  141. 141. (doubling all of our infra, most of which was now horizontally scalable)
  142. 142. Doubled active users in < 6 months
  143. 143. Finally, slowly, building up team
  144. 144. Year 3+: Stability, Video, FB
  145. 145. Scale tools to match team
  146. 146. Deployment & Config Management
  147. 147. Finally 100% on Chef
  148. 148. Simple thing first: knife and chef-solo
  149. 149. Every new hire learns Chef
  150. 150. Code deploys
  151. 151. Many rollouts a day
  152. 152. Continuous integration
  153. 153. But push still needs a driver
  154. 154. "Ops Lock"
  155. 155. Humans are terrible distributed locking systems
  156. 156. Sauron
  157. 157. Redis-enforced locks
  158. 158. Rollout / major config changes / live deployment tracking
  159. 159. Extracting approach Hit issue Develop manual approach Build tools to improve manual / hands on approach Replace manual with automated system
  160. 160. Monitoring
  161. 161. Munin finally broke
  162. 162. Ganglia for graphing
  163. 163. Sensu for alerting (http://sensuapp.org)
  164. 164. StatsD/Graphite still chugging along
  165. 165. waittime: lightweight slow component tracking
  166. 166. s = time.time() # do work statsd.incr("waittime.VIEWNAME.C OMPONENT", time.time() - s)
  167. 167. asPercent()
  168. 168. Feeds and Inboxes
  169. 169. Redis
  170. 170. In memory requirement
  171. 171. Every churned or inactive user
  172. 172. Inbox moved to Cassandra
  173. 173. 1000:1 write/read
  174. 174. Prereq: having rbranson, ex-DataStax
  175. 175. C* cluster is 20% of the size of Redis one
  176. 176. Main feed (timeline) still in Redis
  177. 177. Knobs
  178. 178. Dynamic ramp-ups and config
  179. 179. Previously: required deploy
  180. 180. knobs.py
  181. 181. Only ints
  182. 182. Stored in Redis
  183. 183. Refreshed every 30s
  184. 184. knobs.get(feature_name, default)
  185. 185. Uses Incremental feature rollouts Dynamic page sizing (shedding load) Feature killswitches
  186. 186. As more teams around FB contribute
  187. 187. Decouple deploy from feature rollout
  188. 188. Video
  189. 189. Launch a top 10 video site on day 1 with team of 6 engineers, in less than 2 months
  190. 190. Reuse what we know
  191. 191. Avoid magic middleware
  192. 192. VXCode
  193. 193. Separate from main App servers
  194. 194. Django-based
  195. 195. server-side transcoding
  196. 196. ZooKeeper ephemeral nodes for detection
  197. 197. (finally worth it / doable to deploy ZK)
  198. 198. EC2 autoscaling
  199. 199. Priority list for clients
  200. 200. Transcoding tier is completely stateless
  201. 201. statsd waterfall
  202. 202. holding area for debugging bad videos
  203. 203. 5 million videos in first day 40h of video / hour
  204. 204. (other than perf improvements we’ve basically not touched it since launch)
  205. 205. FB
  206. 206. Where can we skip a few years?
  207. 207. (at our own pace)
  208. 208. Spam fighting
  209. 209. re.compile(‘f[o0][1l][o0]w’)
  210. 210. Simplest thing did not last
  211. 211. Generic features + machine learning
  212. 212. Hadoop + Hive + Presto
  213. 213. "I wonder how they..."
  214. 214. Two-way exchange
  215. 215. 2010 vintage infra
  216. 216. #1 impact: recruiting
  217. 217. Backend team: >10 people now
  218. 218. Wrap up
  219. 219. Core principles
  220. 220. Do the simplest thing first
  221. 221. Every infra moving part is another “thread” your team has to manage
  222. 222. Test & Monitor Everything
  223. 223. Takeaways
  224. 224. Recruit way earlier than you'd think
  225. 225. Simple doesn't always imply hacky
  226. 226. Rocketship scaling has been (somewhat) democratized
  227. 227. Huge thanks to IG Eng Team
  228. 228. mikeyk@instagram.com

×