MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
2011 mongo sf-schemadesign
1. Schema Design at Scale
Eliot Horowitz
@eliothorowitz
MongoSF
May 24, 2011
2. Schema
• Single biggest performance factor
• More choices than in an RDBMS
• Embedding, index design, shard keys
3. Embedding
• Great for read performance
• One seek to load entire object
• One roundtrip to database
• Writes can be slow if adding to objects all
the time
• Should you embed comments?
4. Blog Post - Embedded
{ _id : “/post/eliot/2011-05-24/1”),
author : "eliot",
text : "About MongoDB...",
tags : [ "tech", "databases" ],
comments : [
{
author : "Fred",
date : "Sat Apr 25 2010 20:51:03
GMT-0700",
text : "Best Post Ever!"
}
]}
5. Blog Post - Not Embedded
blog.posts
{ _id : “/post/eliot/2011-05-24/1”),
author : "eliot",
text : "About MongoDB...",
tags : [ "tech", "databases" ]
}
blog.comments
{
post : “/post/eliot/2011-05-24/1”
author : "Fred",
date : "May 24 2011",
text : "Best Post Ever!"
}
6. Blog Post - Hybrid
blog.comments
{
_id : “/post/eliot/2011-05-24/1---1”
comments : [
{ author : "Fred",
date : "May 24 2011",
text : "Best Post Ever!" } ,
{ author : "Bob",
date : "May 24 2011",
text : "Awesome" } ,
]
}
7. Indexes
• Index common queries
• Make sure there aren’t duplicates: (A) and
(A,B) aren’t needed
• Right-balanced indexes keep working set
small
10. Covered Indexes
• Keep data sequential in index
• find( { email : “eliot@10gen.com” } , { first :
1 , last : 1 , state : 1 } )
• index: { email : 1 , first : 1 , last : 1 , state : 1 }
11. Choosing a Shard Key
• Shard key determines how data is
partitioned
• Hard to change
• Most important performance decision
12. Range Based
• collection is broken into chunks by range
• chunks default to 200mb or 100,000
objects
13. Use Case: User Profiles
{ email : “eliot@10gen.com” ,
addresses : [ { state : “NY” } ]
}
• Shard by email
• Lookup by email hits 1 node
• Index on { “addresses.state” : 1 }
14. Use Case: Activity
Stream
{ user_id : XXX, event_id : YYY , data : ZZZ }
• Shard by user_id
• Looking up an activity stream hits 1 node
• Writing even is distributed
• Index on { “event_id” : 1 } for deletes
15. Use Case: Photos
{ photo_id : ???? , data : <binary> }
What’s the right key?
• auto increment
• MD5( data )
• now() + MD5(data)
• month() + MD5(data)