3. 2013
What is Foursquare?
Foursquare helps you
explore the world
around you.
Meet up with friends,
discover new places,
and save money
using your phone.
5. 2013
Moar stats
●
5MM-6MM checkins a day
●
~4K-5K qps against our API, ~150K-300K qps against
Mongo
●
11 clusters, 8 sharded + 3 replica sets
– Mongo 2.2.{3,4}
●
~4TB of data and indexes, all of it kept in memory
– 24-core Intel, 192GB RAM, 4 120GB SSD drives
●
Extensive use of sharding + replica set
ReadPreference.secondary reads
2013
6. 2013
Mongo has scaled with us
We have been using Mongo for three years.
Mongo has enabled us to scale our service
along with our growing user base. It has
given us the flexibility and agility to innovate
in an exciting space where there is still much
to figure out.
2013
7. 2013
Still, some things to deal with
●
Monitoring
– MMS is good, but always could use more stats to
narrow down on pain points.
●
General maintenance
– Constant struggle with data fragmentation
●
Sharding
– No load-based balancing (SERVER-2472)
– Overhead of all-shards queries
– Bugs can leave duplicate records on the wrong
shards
2013
11. 2013
Data size and fragmentation
• Problem: Even with bare metal and SSDs, fragmentation
can degrade performance by increasing data size beyond
available memory.
– Can also be an issue with autobalancing as chunk
moves induce increased paging and further degrade
I/O
• We have enough replicas (~400) where we need to do
this regularly
2013
13. 2013
Hack: “Mackinac”
• (Mostly) automated repair script
• “Kill file” mongod
– Drain queries gracefully from mongod
– About kill files:
http://engineering.foursquare.com/2012/06/20/stability-in-the-midst-of-chaos/
• Stops mongod
• Resyncs from primary
– We considered running compact() but it doesn't
reclaim disk space. May revisit this though.
2013
15. 2013
Hack: Shard key checker
• Checks for shard key usage in the app
• Loads shard keys from mongo config servers,
matches keys against a given {query, document}
• e.g.
db.users({ _id : 12345 }) // shard key match!
db.users({ twitter: “joe@some.com” }) // shard key
miss!
• Why use this?
2013
16. 2013
Detect all-shards queries!
• Problem: All-shards queries
– Not all our queries use shard keys (unfortunately)
– Use up connections, network traffic, overhead in
query processing on mongod + mongos
– Gets worse with more shards
– What happens if one of the replicas is not
responding?
●
Solution: Measure by intercepting queries with
shard key checker and count misses
2013
19. 2013
Find hot chunks!
• Problem: Autobalancer balances by data size, but not
load.
– Checkins shard key → {u : 1}
– Imagine a group of users who check in a bunch.
– Imagine the balancer putting all those users on the
same shard, or even the same chunk.
• Solution: Intercept queries with shard key checker and
bucket hits by chunk
2013
20. 2013
Hack: Hot chunk detector
• Need to do a little more to make this work with the
shard key checker
– Create trees of chunk ranges per-collection
– Match shard keys from queries to chunk ranges,
accumulate counts
2013
22. 2013
Deeper hack: Hot chunk mover
• A standalone process reading from JSON endpoint on
hot chunk detector
– Identifies the hottest chunk
– Attempts to split it (if necessary)
– Move the chunk to the “coldest” shard
• Subject to same problems as regular chunk moves, can
disrupt latencies
• Currently using a hit ratio of p9999/p50 to identify hot
chunk
• Work in progress
2013
24. 2013
Fix data integrity!
• Problem: Application doesn't always clean up after itself
properly.
– Duplicate documents can exist on multiple shards
• Solution: Compare the document shard key from the
host replica against the canonical metadata in shard
key checker (i.e. where the document should “live”)
2013
25. 2013
Hack: “Chunksanity”
• Simple algorithm:
– Connects to each shard
– Iterates through each document in each
collection
– Verifies that the document is correctly placed
according to chunk data in mongo config server
– Deletes any incorrectly placed documents
• Heavyweight process, run only periodically on
specific suspicious collections
2013
26. 2013
Sample Chunksanity logging
[2013-05-29 18:23:36,128] [main] INFO c.f.m.chunksanity.MongoChunkSanity -
Logging misplaced docs to localhost/production_misplaced_docs
[2013-05-29 18:23:36,301] [main] INFO c.f.m.chunksanity.MongoChunkSanity -
Verifying chunks on users at office-canary-2/10.101.1.175:26190
[2013-05-29 18:23:37,971] [ForkJoinPool-1-worker-1] INFO
c.f.m.chunksanity.MongoChunkSanity - Looking at collection foursquare.users on
shard shard0007 using filter: { } and shardKey { "_id" : 1.0}
[2013-05-29 18:24:06,892] [ForkJoinPool-1-worker-2] INFO
c.f.m.chunksanity.MongoChunkSanity - Done with foursquare.users on shard users20
0/xxxx misplaced in 28911 ms
2013