MongoDB for Genealogy

Storing
the
Family
Tree with

We’re going to talk about
MongoDB Intro & Fundamentals
MongoDB for Genealogy data
Scaling MongoDB for all the generations
The Family Tree
Storing a graph in MongoDB

Steve @sp

A
15+ years building
the internet
Father, husband,
skateboarder,
genealogist at ❤

Chief Solutions Architect @
responsible for drivers,
integrations, web & docs

Company behind MongoDB
Ofﬁces in NYC, Palo Alto, London & Dublin
100+ employees
Support, consulting, training
Mgt: Google/DoubleClick, Oracle, Apple, NetApp, Mark Logic

Well Funded: Sequoia, Union Square, Flybridge

1974
The relational database is created

Computers in 1995
100 mhz Pentium
10 base T
16 MB ram
200 MB HD

Cloud in 1995
(Windows 95 cloud wallpaper)

Cell Phones in 2012
Dual core 1.5Ghz
802.11n (300+ Mbps)
1 GB ram
64 GB Solid State

MongoDB
Application Document
Oriented
High { author : “steve”,
date : new Date(),

Performance
text : “About MongoDB...”,
tags : [“tech”, “database”]}

Fully
Consistent
Horizontally Scalable

MongoDB philosophy
Keep functionality when we can (key/value
stores are great, but we need more)
Non-relational (no joins) makes scaling
horizontally practical
Document data models are good
Database technology should run anywhere
virtualized, cloud, metal, etc

Under the hood
Written in C++
Runs nearly everywhere
Data serialized to BSON
Extensive use of memory-mapped ﬁles
i.e. read-through write-through
memory caching.

Database Landscape
Scalability & Performance

MemCache

MongoDB

RDBMS

Depth of Functionality

“
MongoDB has the best
features of key/value
stores, document
databases and relational
databases in one.
John Nunemaker

Relational made normalized
data look like this
Category
• Name
• Url

Article
User • Name
Tag
• Name • Slug • Name
• Email Address • Publish date • Url
• Text

Comment
• Comment
• Date
• Author

Document databases make
normalized data look like this
Article
• Name
• Slug
• Publish date
User • Text
• Name • Author
• Email Address
Comment[]
• Comment
• Date
• Author

Tag[]
• Value

Category[]
• Value

But we’ve been using
a relational database
for 40 years!

How do people store
documents in real life?

Think about a
doctors ofﬁce
There’s two ways they
could organize their ﬁles

Each document type
in it’s own drawer
MRIs X-rays Lab Invoices Index

1 1 1 1

1 1 1 1

History Medications Lab Forms

2. Group related records

Patient 1 Patient 2 Patient 3 ...

Vendor 1 Vendor 2 Vendor 3

2. Group related records

Patient 1 Patient 3 ...

Patient 2

Vendor 1 Vendor 2 Vendor 3

Databases work the same way
Relation Docum

Patient 1 Vendor 1

Article
Category • Name
• Name • Slug
• Url • Publish
User date
• Text
• Name • Author
• Email Address
Article
User Tag
• Name Comment[]
• Name • Name
• Email • Slug • Url • Comment
Address • Publish
date • Date
• Author

Comment Tag[]
• Comment • Value
• Date
• Author
Category[]
• Value

Terminology
RDBMS Mongo
Table, View ➜ Collection
Row ➜ Document
Index ➜ Index
Join ➜ Embedded
Foreign Key ➜ Document
Reference
Partition ➜ Shard

Why MongoDB
My Top 10 Reasons

10. Great developer experience
9. Speaks your language
8. Scale horizontally
7. Fully consistent data w/atomic operations

1.It’s web scale
6. Memory caching integrated
5. Open source
4. Flexible, rich & structured data format not just K:V
3. Ludicrously fast (without going plaid)
2. Simplify infrastructure & application

CMS / Blog
Needs:
• Business needed modern data store for rapid development and
scale

Solution:
• Use PHP & MongoDB

Results:
• Real time statistics
• All data, images, etc stored together
easy access, easy deployment, easy high availability
• No need for complex migrations
• Enabled very rapid development and growth

Photo Meta-Data
Problem:
• Business needed more ﬂexibility than Oracle could deliver

Solution:
• Use MongoDB instead of Oracle

Results:
• Developed application in one sprint cycle
• 500% cost reduction compared to Oracle
• 900% performance improvement compared to Oracle

Customer Analytics
Problem:
• Deal with massive data volume across all customer sites

Solution:
• Use MongoDB to replace Google Analytics / Omniture options

Results:
• Less than one week to build prototype and prove business case
• Rapid deployment of new features

Archiving
Why MongoDB:
• Existing application built on MySQL
• Lots of friction with RDBMS based archive storage
• Needed more scalable archive storage backend
Solution:
• Keep MySQL for active data (100mil)
• MongoDB for archive (2+ billion)
Results:
• No more alter table statements taking over 2 months to run
• Sharding ﬁxed vertical scale problem
• Very happily looking at other places to use MongoDB

Online Dictionary
Problem:
• MySQL could not scale to handle their 5B+ documents

Solution:
• Switched from MySQL to MongoDB

Results:
• Massive simpliﬁcation of code base
• Eliminated need for external caching system
• 20x performance improvement over MySQL

E-commerce
Problem:
• Multi-vertical E-commerce impossible to model (efﬁciently) in
RDBMS

Solution:
• Switched from MySQL to MongoDB

Results:
• Massive simpliﬁcation of code base
• Rapidly build, halving time to market (and cost)
• Eliminated need for external caching system
• 50x+ performance improvement over MySQL

Tons more
MongoDB casts a wide net

people keep coming up with
new and brilliant ways to use it

In Good Company

and 1000s more

Start with an
(or array, hash, dict, e

place1 = {

name : "10gen HQ",

address : "578 Broadway 7th Floor",

city : "New York",

zip : "10011",
tags : [ "business", "awesome" ]
}

Inserting the record
Initial Data Load

> db.places.insert(place1)


Querying
{

name : "10gen HQ",

address : "134 5th Avenue 3rd Floor",

city : "New York",

zip : "10011",
tags : [ "business", "awesome" ]
}

> db.posts.ﬁndOne({ zip: "10011",
tags: "awesome" })

> db.posts.ﬁnd({tags: "business" })

Nested Documents
{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"),
author : "roger",
date : "Sat Apr 24 2011 19:47:11",
text : "About MongoDB...",
tags : [ "tech", "databases" ],
comments : [

{

author : "Fred",

date : "Sat Apr 25 2010 20:51:03",

text : "Best Post Ever!"

}
]
}

Object ID

object(MongoId)#4 (1) {
["$id"]=> string(24) "4e9cc76a4a1817fd21000000"
}

4e9cc76a4a1817fd21000000
|------||----||--||----|
ts mac pid inc

A More Complex Document

place1 = {
name : "10gen HQ",
city : "New York",
zip : "10011",
tags : [ "business", "awesome" ],
latlong : [40.0,72.0],
tips : [ { user : "ryan",
time : 6/26/2011,
tip : "stop by for office hours"},

{.....}]
}

Indexing & Adv Querying
// Index nested documents
db.posts.ensureIndex({ "comments.author":1 })
db.posts.find({'comments.author':'Fred'})

// Regular Expressions
db.posts.find({'comments.author': /^Fr/})

// Index on tags (multi-key index)
db.posts.ensureIndex({ tags: 1})
db.posts.find( { tags: 'tech' } )

// geospatial index
db.posts.ensureIndex({ "author.location": "2d" })
db.posts.find({"author.location":{$near:[22,42]}})

Updating
place1 = {
name : "10gen HQ",
> db.places.update(
{name : "10gen HQ"},
city : "New York",
{ $push :
zip : "10011",
{ tips :
tags : [ "business", "awesome" ],
latlong {: user : "nosh",
[40.0,72.0],
tips : [ { user : "ryan",
time : 6/26/2011,
time : 6/26/2011,
tiptip"Office by for office hours on
: : "stop hours are great!"
} Wednesdays from 4-6pm"},
} { user : "nosh",
time : 7/14/2011,
}
tip : "Office hours are great!"}
) ]
}

Atomic
Operations
$set $unset $rename

$push $pop $pull

$addToSet $in

Cursors
$cursor = $c->ﬁnd(array("foo" => "bar"));

foreach ($cursor as $id => $value) {
echo "$id: ";
var_dump( $value );
}

$a = iterator_to_array($cursor);

Paging
page_num = 3;
results_per_page = 10;

cursor = db.collection.ﬁnd()
.sort({ "ts" : -1 })
.skip(page_num * results_per_page)
.limit(results_per_page);

Storing Big Files

>16mb stored in 16mb chunks

Storing Big Files

Works with replicated and

A better network FS
GridFS files are seamlessly sharded & replicated.
No OS constraints...
No file size limits
No naming constraints
No folder limits
Standard across different OSs
MongoDB automatically generates the MD5 hash of
the file

MongoDB for
Genealogy
Data

Types of
genealogy data
Events (birth, death, Photographs
etc)
Diaries & letters
Ofﬁcial records
Ship passenger list
Census
Occupation
Names
and more
Relationships

Challenges of
genealogy data
Lots of possible data points... need ﬂexible schema
Multiple versions of same data point
(3 different dates for death date, 4 variations on
name).
Data related to records
Multiple versions of same nodes
(intelligent nondestructive merge needed)
Need to have meta data associated

0 @I2@ INDI
1 NAME Charles Phillip /Ingalls/
1 SEX M
1 BIRT
2 DATE 10 JAN 1836
2 PLAC Cuba, Allegheny, NY
1 DEAT

Recog
2 DATE 08 JUN 1902
2 PLAC De Smet, Kingsbury, Dakota Territory
1 FAMC @F2@
1 FAMS @F3@

nize
0 @I3@ INDI
1 NAME Caroline Lake /Quiner/
1 SEX F
1 BIRT
2 DATE 12 DEC 1839

GEDCOM
File format, not a database
Handles the great variety of data well
Doesn’t really scale beyond a local user.
Doesn’t provide good mechanism for storing
external documents (birth certiﬁcates, etc).
Built to solve problem of sharing data

Genealogy &
MongoDB

Genealogy is anything but rigid and ﬁxed
Flexible schema ﬁts genealogy data well
Packaging things together makes sense
Relating records doesn’t require a relational
database

Indivi
•AFN
•Modiﬁcation Date
Events[]
•type
•date
Name •contributor[]
•record[]
•First[]
•Middle[] Location
•Last[] •city
•state
•county
•country

Indivi Events[]
Us
• Name
• AFN • type • Email Address
• Modiﬁcation Date • date • Password
• contributor[] • Individual_id
• record[]
Name
• First[]
• Middle[] Location
• Last[] • city
• state Rec
• county • contributor
• country • type
• coordinates[] • thumbnail
• content
• description
• tags[]

Individual
individual = {
_id : ObjectId("4f2978dfaa999d9db02618ce"),
AFN : '1XYK-KQJ',
name: {
ﬁrst: ['john', 'johannes'],
middle: 'peter',
last: ['smith', 'sandvik']
}
}

Individual
individual = {
AFN : '1XYK-KQJ',
name: {
middle: 'peter',
}
}

db.individual.ﬁnd(
{name.ﬁrst : ‘john’, name.middle : ‘peter’})

Events
events : [
death : {
date : ISODate('1989-07-14'),
location : {
city: 'pensacola',
state: 'ﬂ',
county: 'escambia',
country: 'usa'
coordinates : [30.26,87.12]},
contributor : ObjectId("4eeac...691")}]

events : [
death : {
Events
date : ISODate('1989-07-14'),
location : {
city: 'pensacola',
state: 'ﬂ',
county: 'escambia',
country: 'usa'
coordinates : [30.26,87.12]},
contributor : ObjectId("4eeac...691")}]

{events.death.date : ISODate(‘1989-07-14’)})

{events.death.location : { $near:[30,90]}})

Duplicate Events
events : [
birth : [ {
date : ISODate('1928-04-06'),
location : {
city: 'brattleboro',
state: 'vt',
county: 'windham',
country: 'usa'
coordinates : [42.51,72.34]},
contributor : ObjectId("4ee...00000"),
records: ObjectId("4ed8a...7b000000")
},

county: 'windham',

Duplicate Events
country: 'usa'
coordinates : [42.51,72.34]},
contributor : ObjectId("4ee...00000"),
records: ObjectId("4ed8a...7b000000")
},
{
date : ISODate('1928-04-16'),
location : {
state: 'vt',
county: 'windham',
country: 'usa'
coordinates : [42.51,72.34]},
contributor : ObjectId("4ee...37bb"),
records: ObjectId("4eea...0000c8"),
}],
}

Duplicate Events
events : [
birth : [ { date : ISODate('1928-04-06')},
{ date : ISODate('1928-04-16')}],
]

{events.birth.date : ISODate(‘1928-04-16’)})

Same Query
Works!!

Multiple Events
marriage : [{
date : ISODate('1939-08-11'),
end_date : ISODate('1940-02-19'),
to : ObjectId("4f297978aa999d9db02618cf"),
location : {
city: 'raleigh',
state: 'nc',
county: 'wake',
country: 'usa'
coordinates : [35.49,78.38]},
contributor : ObjectId("4eeac...91537bb")},
{
date : ISODate('1944-04-19'),
to : ObjectId("4f2978dfaa999d9db02618ce"),
location : {

marriage : [{

Multiple Events
date : ISODate('1939-08-11'),
end_date : ISODate('1940-02-19'),
to : ObjectId("4f297978aa999d9db02618cf"),
location : {
city: 'raleigh',
state: 'nc',
county: 'wake',
country: 'usa'
coordinates : [35.49,78.38]},
contributor : ObjectId("4eeac...91537bb")},
{
date : ISODate('1944-04-19'),
to : ObjectId("4f2978dfaa999d9db02618ce"),
location : {
city: 'atlanta',
state: 'ga',
county: 'fulton',
country: 'usa'
coordinates : [33.45,84.23]},
contributor : ObjectId("4eeb...37bb")}]

individual = { All

togeth
AFN : '1XYK-KQJ',
name: {
middle: 'peter',
},
events : [

er
birth : [
{
date : ISODate('1928-04-06'),
location : {
Text
state: 'vt',
county: 'windham',
country: 'usa'
coordinates : [42.51,72.34]
},
contributor : ObjectId("4eeabc958b691537bb000000"),
records: ObjectId("4ed8aea7d8562f7d7b000000")
},
{
date : ISODate('1928-04-16'),
location : {

Records
record1 = {
_id : ObjectId("4ed8aea7d8562f7d7b")
contributor : ObjectId("4eeab...1537bb"),
type : 'birth certificate',
thumbnail : BinData(0,"/9j/4AAQSkZJ...."),
content : BinData(0,"j6b/Id11lWqs..."),
tags : ['NY', 'certified'],
description : "John's birth certificate"
}

Users
user = {
_id : ObjectId("4eeabc958b691537bb"),
username : 'spf13',
email_address : 'genealogy@spf13.com',
password : 'a.long.passphrase18',
individual_id : ObjectId("4f2f...0ce"),
}

Scaling
MongoDB
for all the
generation

Replica Sets
Primary Primary Primary

Secondary Secondary Secondary

Secondary Arbiter Secondary

Secondary

Secondary

Sharding
App App App
Server Server Server
MongoS MongoS MongoS

ConfigD
ConfigD
ConfigD

MongoD MongoD MongoD MongoD



It’s not a tree at all,
It’s really a graph
... and an odd one at that

It would be easy if it
always looked like this

All sorts of mess
Step & adopted relationships
Duplicate nodes
Lots of missing nodes
Divorces and re-marriages
Multiple names for the same person
Multiple dates for the same event

Graphs are important

Without them we couldn’t store family relationships

Trees / graphs
in MongoDB
Since MongoDB data structures are
essentially objects, a good degree of
ﬂexibility here.
Think of how you would structure them in
your application

Trees / graphs
in MongoDB
Each node is stored as a document

Contains references to related nodes

What is “related” depends on your
application

References vs
Relation
MongoDB uses references
Unlike foreign keys, references don’t
enforce integrity
Reference is really just a reference
For many applications a reference is
sufﬁcient

Simple relationship
{ _id: "a" } { _id: "b" } { _id: "c" } { _id: "d" }
{ _id: "e", parents: ["a", "b" ]}
{ _id: "f", parents: ["c", "d" ]}
{ _id: "g", parents: ["e", "f" ]}

•= b =allancestors of g: of'g'});'b'}).toArray();
Easy to access b:
//find
//find all descendants
var
nodes in either direction
db.family.find({ _id:
g db.family.findOne({_id:
•Good for trees / {graphs
descendantsFind = function(par) {
ancestorFind = function(child)

• if ( ! (i in par) return sets
var rv
Can==[];[]; { large rv;
var rv
grab
for child.parents)
//finddb.family.find( { descendants of b:} ).toArray();
var k = all db.family.find( { _id : :{ par[i]._id }).toArray();
parents = direct parents $in : child.parents }
•Minimum amount of maintenance
rv = rv.concat(parents);
rv = rv.concat(k);
>forrv = irv.concat(descendantsFind(k)); : ‘b’})
db.family.find({ parents
(var in parents) {
•Balanced ancestorFind(parents[i]));
}
}
rv = rv.concat(
return rv;
•Implied relationships
}
}
return rv;

descendantsFind(b);
ancestorFind(g);

Bi-directional
{ _id: "a", children: ["e"] }
{ _id: "b", children: ["e"] }
{ _id: "c", children: ["f"] }
{ _id: "d", children: ["f"] }
{ _id: "e", children: ["g"], parents: ["a", "b" ]}
{ _id: "f", children: ["g"], parents: ["c", "d" ]}
{ _id: "g", children: [] , parents: ["e", "f"] }

•Doesn’t really add much beyond the ﬁrst example
•More maintenance
•Duplication of each relationship
•Only real advantage is ability to grab all related
nodes (both directions) with one query.

Array of Ancestors
{ _id: "a" }
{ _id: "b" }
{ _id: "c" }
{ _id: "d" }
{ _id: "e", ancestors: [ "a", "b" ], parents: ["a", "b" ]}
{ _id: "f", ancestors: [ "c", "d" ], parents: ["c", "d" ]}
{ _id: "g", ancestors: [ "a", "b", "c", "d", "e", "f" ], parents: ["e", "f"] }

Great for small trees (or subsets).
//find all descendants of b:
> db.tree.find({ ancestors: ‘b’})
Could be used to store X generations of ancestors
Optimized for retrieving entire tree
//find all direct descendants of b:
> db.tree.find({ parents: ‘b’})
Uses implied relationships
//find all ancestors of g:
No = db.tree.findOne( { _id: 'g'is )this person my grandson?
> g help on specifics... }
> db.tree.find( { _id: { $in : g.ancestors } )
Easier retrieval at expense of costlier maintenance

Relations (basic)
{ _id : "b",
relations : [
{
id : "a",
relation : "parent"},
{
id : "c",
relation : "grandparent"},
{
id : "d",
relation : "parent"}]}

Relations (detailed)
{ _id : "b",
relations : [
{
id : "a",
relation : "parent",
type : "mother",
subtype : "biological" },
{
id : "c",
type : "father",
subtype : "adopted"},
{
id : "d",
type : "father",
subtype : "biological"}]}

Shouldn’t I store my
family tree in a graph
database?
They are built to store trees after all

Graphs are great at
traversing deep in a tree

• Is this node my
relative?

• Retrieve my paternal
great, great, great,
great grandpa

Unfortunately that’s not
how we commonly work
Typically we are working with a node and
it’s immediate neighbors
The signiﬁcant majority of our operations
aren’t traversing

If those operations are
important, perhaps a
hybrid graph & document
solution makes sense

http://spf13.com
http://github.com/s
@spf13

Question
download at mongodb.org
We’re hiring!! Contact us at jobs@10gen.com

MongoDB for Genealogy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MongoDB for Genealogy

Similar to MongoDB for Genealogy (20)

More from Steven Francia

More from Steven Francia (20)

Recently uploaded

Recently uploaded (20)

MongoDB for Genealogy

Editor's Notes