Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)

1. A Brief, Rapid History of Scaling Instagram  (with a tiny team) Mike Krieger QConSF 2013 !

2. Hello!

3. Instagram

4. 30 million with 2 eng (2010-end 2012)

5. 150 million with 6 eng (2012-now)

6. How we scaled

7. What I would have done differently

8. What tradeoffs you make when scaling with that size team

9. (if you can help it, have a bigger team)

10. perfect solutions

11. survivor bias

12. decision-making process

13. Core principles

14. Do the simplest thing ﬁrst

15. Every infra moving part is another “thread” your team has to manage

16. Test & Monitor Everything

17. This talk Early days Year 1: Scaling Up Year 2: Scaling Out Year 3-present: Stability, Video, FB

18. Getting Started

19. 2010 2 guys on a pier

23. no one <3s it

24. Focus

26. Mike iOS, Kevin Server

27. Early Stack Django + Apache mod_wsgi Postgres Redis Gearman Memcached Nginx

28. If today Django + uWSGI Postgres Redis Celery Memcached HAproxy

29. Three months later

31. Server planning night before launch

32. Traction!

33. Year 1: Scaling Up

34. scaling.enable()

35. Single server in LA

36. infra newcomers

37. “What’s a load average?”

38. “Can we get another server?”

39. Doritos & Red Bull & Animal Crackers & Amazon EC2

40. Underwater on recruiting

41. 2 total engineers

42. Scale "just enough" to get back to working on app

43. Every weekend was an accomplishment

44. “Infra is what happens when you’re busy making other plans” —Ops Lennon

45. Scaling up DB

46. First bottleneck: disk IO on old Amazon EBS

47. At the time: ~400 IOPS max

48. Simple thing ﬁrst

49. Vertical partitioning

50. Django DB Routers

51. Partitions Media Likes Comments Everything else

52. PG Replication to bootstrap nodes

53. Bought us some time

54. Almost no application logic changes (other than some primary keys)

55. Today: SSD and provisioned IOPS get you way further

56. Scaling up Redis

57. Purely RAM-bound

58. fork() and COW

59. Vertical partitioning by data type

60. No easy migration story; mostly double-writing

61. Replicating + deleting often leaves fragmentation

62. Chaining replication = awesome

63. Scaling Memcached

64. Consistent hashing / ketama

65. Mind that hash function

66. Why not Redis for kv caching?

67. Slab allocator

68. Conﬁg Management & Deployment

69. fabric + parallel git pull (sorry GitHub)

70. All AMI based snapshots for new instances

71. update_ami.sh

72. update_update_ami.sh

73. Should have done Chef earlier

74. Munin monitoring

75. df, CPU, iowait

76. Ending the year

77. Infra going from 10% time to 70%

78. Focus on client

79. Testing & monitoring kept concurrent ﬁres to a minimum

80. Several ticking time bombs

81. Year 2: Scaling Out

82. App tier

83. Stateless, but plentiful

84. HAProxy (Dead node detection)

85. Connection limits everywhere

86. PGBouncer Homegrown Redis pool

87. Hard to track down kernel panics

88. Skip rabbit hole; use instance- status to detect and restart

89. Database Scale Out

90. Out of IO again (Pre SSDs)

91. Biggest mis-step

92. NoSQL?

93. Call our friends

94. and strangers

95. Theory: partitioning and rebalancing are hard to get right, let DB take care of it

96. MongoDB (1.2 at the time)

97. Double write, shadow reads

98. Stressing about Primary Key

99. Placed in prod

100. Data loss, segfaults

101. Could have made it work…

102. …but it would have been someone’s full time job

103. (and we still only had 3 people)

104. train + rapidly approaching cliff

105. Sharding in Postgres

106. QCon to the rescue

107. Similar approach to FB (infra foreshadowing?)

108. Logical partitioning, done at application level

109. Simplest thing; skipped abstractions & proxies

110. Pre-split

111. 5000 partitions

112. note to self: pick a power of 2 next time

113. Postgres "schemas"

114. database schema table columns

115. machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

116. machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user machineA’: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

117. machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user machineA’: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

118. Still how we scale PG today

119. 9.2 upgrade: bucardo to move schema by schema

120. ID generation

121. Requirements No extra moving parts 64 bits max Time ordered Containing partition key

122. 41 bits: time in millis (41 years of IDs) 13 bits: logical shard ID 10 bits: auto-incrementing sequence, modulo 1024.

123. This means we can generate 1024 IDs, per shard, per table, per millisecond

124. Lesson learned

125. A new db is a full time commitment

126. Be thrifty with your existing tech

127. = minimize moving parts

128. Scaling conﬁgs/host discovery

129. ZooKeeper or DNS server?

130. No team to maintain

131. /etc/hosts

132. ec2tag KnownAs

133. fab update_etc_hosts (generates, deploys)

134. Limited: dead host failover, etc

135. But zero additional infra, got the job done, easy to debug

136. Monitoring

137. Munin: too coarse, too hard to add new stats

138. StatsD & Graphite

139. Simple tech

140. statsd.timer statsd.incr

141. Step change in developer attitude towards stats

142. <5 min from wanting to measure, to having a graph

143. 580 statsd counters 164 statsd timers

144. Ending the year

145. Launched Android

146. (doubling all of our infra, most of which was now horizontally scalable)

147. Doubled active users in < 6 months

148. Finally, slowly, building up team

149. Year 3+: Stability, Video, FB

150. Scale tools to match team

151. Deployment & Conﬁg Management

152. Finally 100% on Chef

153. Simple thing ﬁrst: knife and chef-solo

154. Every new hire learns Chef

155. Code deploys

156. Many rollouts a day

157. Continuous integration

158. But push still needs a driver

159. "Ops Lock"

160. Humans are terrible distributed locking systems

161. Sauron

162. Redis-enforced locks

163. Rollout / major conﬁg changes / live deployment tracking

164. Extracting approach Hit issue Develop manual approach Build tools to improve manual / hands on approach Replace manual with automated system

165. Monitoring

166. Munin ﬁnally broke

167. Ganglia for graphing

168. Sensu for alerting (http://sensuapp.org)

169. StatsD/Graphite still chugging along

170. waittime: lightweight slow component tracking

171. s = time.time() # do work statsd.incr("waittime.VIEWNAME.C OMPONENT", time.time() - s)

172. asPercent()

174. Feeds and Inboxes

175. Redis

176. In memory requirement

177. Every churned or inactive user

179. Inbox moved to Cassandra

180. 1000:1 write/read

181. Prereq: having rbranson, ex-DataStax

182. C* cluster is 20% of the size of Redis one

183. Main feed (timeline) still in Redis

184. Knobs

185. Dynamic ramp-ups and conﬁg

186. Previously: required deploy

187. knobs.py

188. Only ints

189. Stored in Redis

190. Refreshed every 30s

191. knobs.get(feature_name, default)

192. Uses Incremental feature rollouts Dynamic page sizing (shedding load) Feature killswitches

194. As more teams around FB contribute

195. Decouple deploy from feature rollout

196. Video

197. Launch a top 10 video site on day 1 with team of 6 engineers, in less than 2 months

198. Reuse what we know

199. Avoid magic middleware

200. VXCode

201. Separate from main App servers

202. Django-based

203. server-side transcoding

204. ZooKeeper ephemeral nodes for detection

205. (ﬁnally worth it / doable to deploy ZK)

206. EC2 autoscaling

207. Priority list for clients

208. Transcoding tier is completely stateless

209. statsd waterfall

210. holding area for debugging bad videos

211. 5 million videos in ﬁrst day 40h of video / hour

212. (other than perf improvements we’ve basically not touched it since launch)

213. FB

214. Where can we skip a few years?

215. (at our own pace)

216. Spam ﬁghting

217. re.compile(‘f[o0][1l][o0]w’)

218. Simplest thing did not last

219. Generic features + machine learning

220. Hadoop + Hive + Presto

221. "I wonder how they..."

222. Two-way exchange

223. 2010 vintage infra

224. #1 impact: recruiting

225. Backend team: >10 people now

226. Wrap up

227. Core principles

228. Do the simplest thing ﬁrst

229. Every infra moving part is another “thread” your team has to manage

230. Test & Monitor Everything

231. Takeaways

232. Recruit way earlier than you'd think

233. Simple doesn't always imply hacky

234. Rocketship scaling has been (somewhat) democratized

235. Huge thanks to IG Eng Team

236. mikeyk@instagram.com

Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)

Similar to Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team) (20)

More from Jean-Luc David

More from Jean-Luc David (20)

Recently uploaded

Recently uploaded (20)

Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)