SlideShare a Scribd company logo
1 of 89
Open source, high performance database




Introduction to NoSQL and MongoDB
Will LaForest
Senior Director of 10gen Federal
will@10gen.com
@WLaForest




                                         1
SQL                              Dynamic
                   invented                          Web Content                       released

                                                                                    10gen
                                                             Web applications      founded
                          Oracle              Client
IBM’s IMS                founded              Server                           SOA

                                                                        BigTable



1965        1970     1975     1980     1985   1990       1995         2000     2005      2010




     Codd publishes            PC’s gain                  3 tier                  Cloud
 relational model paper         traction               architecture             Computing
         in 1970
                                                                Brewer’s Cap            NoSQL
                                              WWW                   born               Movement
                                              born
                                                                                                  2
Attribute




Tuple




                    Relation
                               3
4
• Data stored in a RDBMS is very compact (disk was
  more expensive)
• SQL and RDBMS made queries flexible with rigid
  schemas
• Rigid schemas helps optimize joins and storage
• Massive ecosystem of tools, libraries and integrations
• Been around 40 years!




                                                           5
• Gartner uses the 3Vs to define
• Volume - Big/difficult/extreme volume is relative
• Variety
   –   Changing or evolving data
   –   Uncontrolled formats
   –   Does not easily adhere to a single schema
   –   Unknown at design time
• Velocity
   – High or volatile inbound data
   – High query and read operations
   – Low latency
                                                      6
VOLUME & NEW
                             ARCHITECTURES
                             • Systems scaling horizontally,
                               not vertically
                             • Commodity servers
                             • Cloud Computing



DATA VARIETY &
VOLATILITY
• Extremely difficult to
  find a single fixed
  schema
• Don’t know data
  schema a-priori          TRANSACTIONAL MODEL
                           • N x Inserts or updates
                           • Distributed transactions


                                                           7
• Non-relational been hanging around (MUMPS?)
• Modern NoSQL theory and offerings started in early
  2000s
• Modern usage of term introduced in 2009
• NoSQL = Not Only SQL
• A collection of very different products
• Alternatives to relational databases when they are a
  bad fit
• Motives
   – Horizontally scalable (commodity server/cloud computing)
   – Flexibility
                                                                8
• Value (Data) mapped to a key (think primary)
• Some typed some just BLOBs
• Redis, MemcachDB, Voldemort


        “Will”       • “@WLaForest”
      “Chris”        • 12
     “Robert” • BLOB
                                                 9
•   Data stored on disk in a column oriented fashion
•   Predominantly hash based indexing
•   Data partitioned by range or consistent hashing
•   Google BigTable, HBase, Cassandra, Accumulo




                                                       10
• Key-Value Stores
   – Key-Value Stores
   – Value (Data) mapped to a key (think primary)
• Big Table Descendants
   – Looks like a distributed multi-dimension map
   – Data stored in a column oriented fashion
   – Predominantly hash based indexing
• Document oriented stores
   – Data stored as either JSON or XML documents
• What do they all have in common?

                                                    11
• Not a database
• Map reduce on HDFS or other data source
• When you want to use it
  – Can’t use a index
  – Distributing custom algorithms
  – ETL
• Great for grinding through data
• Many NoSQL offerings have native map reduce
  functionality


                                                12
13
• 2007
  – Eliot Horowitz & Dwight Merriman tired of reinventing the
    wheel
  – 10gen founded
  – MongoDB Development begins
• 2009
  – Initial release of MongoDB
• 73M+ in funding
  – Funded by Sequoia, NEA, Union Square Ventures, Flybridge
    Capital


                                                                14
• Pre-production Subscriptions for MongoDB
   – Fixed cost = $30k
   – 6 month term, unlimited servers
• MongoDB Subscriptions
   – 32 month term, $4k per server
   – 24x7 support for development and production, 1 hour response time
   – Onboarding call and quarterly review
• Consulting
   – Cost is $300/hr
• Training
   – MongoDB Developer - $1500/student, 2 day course
   – MongoDB Administrator - $1500/student, 2 day course
   – MongoDB Essential - $2250/student, 3 day course
• MongoDB Monitoring Service
                                                                         15
• 4 servers with 2 processors each
• Total processors equals 8
• Oracle Enterprise Edition
   – $47,500 per processor plus maintenance
   – 8 x 47,500 = $380,000 + $76,000
   – Total Cost = $456,000
• MongoDB Subscription
   –   $4,000 per server (no processor count)
   –   No license fee or maintenance charge
   –   4 x 4,000 = $16,000
   –   Total Cost = $16,000
• MongoDB is $440,000 less expensive
                                                16
#2 on Indeed’s Fastest Growing Jobs        Jaspersoft BigData Index

                                                          Demand for
                                                          MongoDB, the
                                                          document-oriented
                                                          NoSQL database, saw
                                                          the biggest spike
                                                          with over 200%
                                                          growth in 2011.




                                                 451 Group
         Google Searches              “MongoDB increasing its dominance”




                                                                                17
#2 ON INDEED’S FASTEST GROWING JOBS




                                      18
“MongoDB INCREASING ITS DOMINANCE”




                                     19
• Scale horizontally over commodity hardware
• RDBMSs great so keep what works
   – Rich data models
   – Adhoc queries
   – Fully featured indexes
• What doesn’t distribute well?
   – Long running multi-row transactions
   – Joins
   – Both artifacts of the relational data model
• Do not homogenize programming interfaces
• Local storage first class citizen for DB storage
                                                     20
•   Data stored as documents (JSON)
•   Schema free
•   CRUD operations – (Create Read Update Delete)
•   Atomic document operations
•   Ad hoc Queries like SQL
    –   Equality
    –   Regular expression searches
    –   Ranges
    –   Geospatial
•   Secondary indexes
•   Sharding (sometimes called partitioning) for scalability
•   Replication – HA and read scalability
                                                               21
•   MongoDB does not need any defined data schema.
•   Every document could have different data!

    {name: “will”,           name: “jeff”,   {name: “brendan”,
     eyes: “blue”,           eyes: “blue”,    aliases: [“el diablo”]}
     birthplace: “NY”,       height: 72,
     aliases: [“bill”, “la   boss: “ben”}
    ciacco”],                                {name: “matt”,
     gender: ”???”,                           pizza: “DiGiorno”,
     boss: ”ben”}            name: “ben”,     height: 72,
                             hat: ”yes”}      boss: 555.555.1212}




                                                                        22
Seek = 5+ ms          Read = really really fast




               Post

                           Comment
Author




                                                  23
Post


  Author



  Comment
  Comment
   Comment
    Comment
    Comment




              24
RDBMS      MongoDB



Database   Database


Table      Collection


Row        Document




                        25
• We are running instances on the order of:
  -   100B objects
  -   50TB storage
  -   50K qps per server
  -   ~1200 servers




                                              26
• SSL
   – between client and server
   – Intra-cluster communication
• Authorization at the database level
   – Read Only/Read+Write/Administrator
• Security Roadmap (tentative)
   –   Pluggable authentication 2.4
   –   Auditing 2.4
   –   Cell level security 2.6
   –   Common Criteria certification

                                          27
28
var p = { author: “roger”,
     date: new Date(),
     text: “Spirited Away”,
     tags: *“Tezuka”, “Manga”+-

> db.posts.save(p)




                                  29
>db.posts.find()

 { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"),
   author : "roger",
   date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)",
   text : "Spirited Away",
   tags : [ "Tezuka", "Manga" ] }

Notes:
 - _id is unique, but can be anything you’d like
                                                       30
Create index on any Field in Document

 // 1 means ascending, -1 means descending

 >db.posts.ensureIndex({author: 1})

 >db.posts.find({author: 'roger'})

 { _id     : ObjectId("4c4ba5c0672c685e5e8aabf3"),
   author : "roger",
   ... }


                                                     31
• Conditional Operators
  – $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $size, $type
  – $lt, $lte, $gt, $gte

  // find posts with any tags
  > db.posts.find( {tags: {$exists: true }} )

  // find posts matching a regular expression
  > db.posts.find( {author: /^rog*/i } )

  // count posts by author
  > db.posts.find( {author: ‘roger’- ).count()
                                                                   32
• $set, $unset, $inc, $push, $pushAll, $pull, $pullAll, $bit

> comment = { author: “fred”,
           date: new Date(),
           text: “Best Movie Ever”-

> db.posts.update( { _id: “...” -,
              $push: {comments: comment} );



                                                               33
{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"),
  author : "roger",
  date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)",
  text : "Spirited Away",
  tags : [ "Tezuka", "Manga" ],
  comments : [
   {
       author : "Fred",
       date : "Sat Jul 24 2010 20:51:03 GMT-0700 (PDT)",
       text : "Best Movie Ever"
   }
  ]}
                                                           34
// Index nested documents
> db.posts.ensureIndex( “comments.author”:1 )
 db.posts.find(,‘comments.author’:’Fred’-)

// Index on tags
> db.posts.ensureIndex( tags: 1)
> db.posts.find( { tags: ’Manga’ - )

// geospatial index
> db.posts.ensureIndex( “author.location”: “2d” )
> db.posts.find( “author.location” : , $near : [22,42] } )
                                                             35
36
•   Native Map/Reduce in JS in MongoDB
    –   Distributes across the cluster with good data locality
•   New aggregation framework
    –   Declarative (no JS required)
    –   Pipeline approach (like Unix ps -ax | tee processes.txt | more)
•   Hadoop
    –   Intersect the indexing of MongoDB with the brute force parallelization
        of hadoop
    –   Hadoop MongoDB connector




                                                                                 37
38
$project                 $match                  $limit              $skip
   $unwind                  $group                  $sort


{                                                    db.article.aggregate(
 title : “this is my title” ,                           { $project : {
 author : “bob” ,                                           author : 1,
 posted : new Date () ,                                     tags : 1,
 pageViews : 5 ,                                        }},
 tags : [ “fun” , “good” , “fun” ] ,                    { $unwind : "$tags" },
 comments : [                                           { $group : {
      { author :“joe” , text : “this is cool” } ,           _id : “$tags”,
      { author :“sam” , text : “this is bad” }              authors : { $addToSet : "$author" }
 ],                                                     }}
 other : { foo : 5 }                                 );
}



                                                                                              39
40
Input data   MAP   Intermediate data   REDUCE   Output data




              1                          1




              2                          2




              3                          3




                                                              41
Input data   MAP   Intermediate data   REDUCE   Output data




              1                          1




              2                          2




              3                          3




                                                              42
43
Write
                 Primary
         Read
                             Asynchronous
                 Secondary   Replication
         Read
Driver




                 Secondary
         Read



                                            44
Primary


                Secondary
         Read
Driver




                Secondary
         Read



                            45
Primary

         Write
                 Primary     Automatic
         Read                Leader Election
Driver




                 Secondary
         Read



                                               46
Secondary
         Read
         Write
         Read    Primary
Driver




                 Secondary
         Read



                             47
Additional Details


                           Clients




            MongoS         MongoS               MongoS                     Config


                                                                           Config

Key Range      Key Range             Key Range           Key Range
0..30          31..60                61..90              91.. 100          Config

 Primary        Primary               Primary             Primary



Secondary      Secondary             Secondary           Secondary



Secondary      Secondary             Secondary           Secondary



                                                                                    48
•   Sharding Details
•   Replica Set Details
•   Consistency Details
•   Common Deployment Scenarios
•   Citations




                                  49
50
Write
                 Primary


                 Secondary
         Read

                 Secondary
Driver




         Read



                             51
Write
                 Primary
Driver




         Read

                 Secondary


                 Secondary



                             52
1. Write
                    Primary


                    Secondary   1. Replicate
         2. Read

                    Secondary
Driver




         2. Read



                                               53
• Fire and forget
• Wait for error
• Wait for fsync
• Wait for journal sync
• Wait for replication




                          54
Driver           Primary
         write

                           apply in memory




                                             55
Driver             Primary
            write
         getLastError
                             apply in memory




                                               56
Driver              Primary
            write
         getLastError
                              apply in memory
           j:true
                              Write to journal




                                                 57
Driver             Primary
            write
         getLastError
                             apply in memory
          fsync:true


                             fsync




                                               58
Driver             Primary                     Secondary
            write
         getLastError
                             apply in memory
           w:2
                                 replicate




                                                           59
Value          Meaning

<n:integer>    Replicate to N members of
               replica set
“majority”     Replicate to a majority of
               replica set members
<m:modeName>   Use cutom error mode
               name




                                            60
61
> db.runCommand( { shardcollection: “test.users”,
          key: { email: 1 }} )
      {
          name: “Jared”,
          email: “jsr@10gen.com”,
      }
      {
          name: “Scott”,
          email: “scott@10gen.com”,
      }
      {
          name: “Dan”,
          email: “dan@10gen.com”,
      }


                                                    62
-∞   +∞




      63
-∞                                              +∞




     dan@10gen.com            scott@10gen.com


              jsr@10gen.com




                                                 64
Split!




-∞                                              +∞




     dan@10gen.com            scott@10gen.com


              jsr@10gen.com




                                                 65
This is a                   Split!        This is a
     chunk                                     chunk


-∞                                                          +∞




                 dan@10gen.com            scott@10gen.com


                          jsr@10gen.com




                                                             66
-∞                                              +∞




     dan@10gen.com            scott@10gen.com


              jsr@10gen.com




                                                 67
Split!




-∞                                              +∞




     dan@10gen.com            scott@10gen.com


              jsr@10gen.com




                                                 68
-∞                adam@10gen.com    1
adam@10gen.com    jared@10gen.com   1
jared@10gen.com   scott@10gen.com   1
scott@10gen.com   +∞                1


• Stored in the config serers
• Cached in mongos
• Used to route requests and keep cluster balanced


                                                     69
mongos
                                                                          config
                                     balancer
                                                                          config
Chunks!
                                                                          config




  1   2    3    4     13   14   15   16         25   26   27   28    37   38   39   40

  5   6    7    8     17   18   19   20         29   30   31   32    41   42   43   44

  9   10   11   12    21   22   23   24         33   34   35   36    45   46   47   48


 Shard 1             Shard 2                Shard 3                 Shard 4



                                                                                         70
mongos
                                                                          config
                                     balancer
                                                                          config


                    Imbalance                                             config




 1   2    3    4

 5   6    7    8

 9   10   11   12     21   22   23   24         33   34   35   36    45   46   47   48


Shard 1             Shard 2                 Shard 3                 Shard 4



                                                                                         71
mongos
                                                                          config
                                     balancer
                                                                          config

                               Move chunk 1 to                            config
                               Shard 2




 1   2    3    4

 5   6    7    8

 9   10   11   12    21   22    23   24         33   34   35   36    45   46   47   48


Shard 1             Shard 2                 Shard 3                 Shard 4



                                                                                         72
mongos
                                                                         config
                                    balancer
                                                                         config

                                                                         config




 1   2    3    4

 5   6    7    8

 9   10   11   12    21   22   23   24         33   34   35   36    45   46   47   48


Shard 1             Shard 2                Shard 3                 Shard 4



                                                                                        73
mongos
                                                                         config
                                    balancer
                                                                         config
                                         Chunks 1,2, and 3
                                         have migrated
                                                                         config




               4

 5   6    7    8     1                         2                    3

 9   10   11   12    21   22   23   24         33   34   35   36    45   46   47   48


Shard 1             Shard 2                Shard 3                 Shard 4




                                                                                        74
75
1             1.Query arrives at
                                           mongos
                                4
                    mongos
                                         2.mongos routes query
                                           to a single shard

                                         3.Shard returns results
                       2
                                           of query
                       3
                                         4.Results returned to
                                           client

Shard 1   Shard 2              Shard 3




                                                                   76
1               1.Query arrives at
                                               mongos
                                 4
                    mongos                   2.mongos broadcasts
                                               query to all shards

                                             3.Each shard returns
           2                                   results for query
                     2               2
                                         3
           3             3                   4.Results combined
                                               and returned to client

Shard 1   Shard 2                Shard 3




                                                                        77
1               1.Query arrives at mongos

                                   6           2.mongos broadcasts query
                    mongos                       to all shards
                               5               3.Each shard locally sorts
                                                 results

             2                                 4.Results returned to
                       2                         mongos
                                   2
             4             4           4
                                               5.mongos merge sorts
                                                 individual results
   3          3                            3
                                               6.Combined sorted result
Shard 1   Shard 2              Shard 3           returned to client




                                                                        78
Inserts   Requires shard   db.users.insert({
          key                name: “Jared”,
                             email: “jsr@10gen.com”})

Removes   Routed           db.users.delete({
                             email: “jsr@10gen.com”})

          Scattered        db.users.delete({name: “Jared”})


Updates   Routed           db.users.update(
                             {email: “jsr@10gen.com”},
                             {$set: { state: “CA”}})
          Scattered        db.users.update(
                             {state: “FZ”},
                             {$set:{ state: “CA”}} )
                                                              79
By Shard      Routed            db.users.find(
                                  {email: “jsr@10gen.com”})
Key

Sorted by     Routed in order   db.users.find().sort({email:-1})
shard key
Find by non   Scatter Gather    db.users.find({state:”CA”})
shard key
Sorted by     Distributed merge db.users.find().sort({state:1})
              sort
non shard
key

                                                                   80
81
Data Center




     Primary   Secondary   Secondary




                                       82
Data Center




     Primary   Secondary   Secondary
                           hidden=true




                            backups




                                         83
Active Data Center                  Standby Data Center




  Primary            Secondary          Secondary
  priority = 1       priority = 1




                                                          84
West Coast DC   Central DC      East Coast DC




 Secondary       Primary           Secondary
                 priority = 1




                                                85
86
•   History of Database Management (http://bit.ly/w3r0dv)
•   EMC IDC Study (http://bit.ly/y1mJgJ)
•   Gartner & Big Data (http://bit.ly/xvRP3a)
•   SQL (http://en.wikipedia.org/wiki/SQL)
•   Database Management Systems
    http://en.wikipedia.org/wiki/Dbms)
• Dynamo: Amazon’s Highly Available Key-value Store
    (http://bit.ly/A8F8oy)
• CAP Theorem (http://bit.ly/zvA6O6)
• NoSQL Google File System and BigTable
    (http://oreil.ly/wOXliP)
• NoSQL Movement whitepaper (http://bit.ly/A8RBuJ)
• Sample ERD diagram (http://bit.ly/xV30v)

                                                            87
88
• Impossible for a distributed computer system to
  simultaneously provide all three of the following
  guarantees
   – Consistency - All nodes see the same data at the same
     time.
   – Availability - A guarantee that every request receives a
     response about whether it was successful or failed.
   – Partition tolerance - No set of failures less than total
     network failure is allowed to cause the system to respond
     incorrectly



                                                                 89

More Related Content

What's hot

Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
No sqlpresentation
No sqlpresentationNo sqlpresentation
No sqlpresentationSalma Gouia
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at ScaleMongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Basics of MongoDB
Basics of MongoDB Basics of MongoDB
Basics of MongoDB Habilelabs
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Ray: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed PythonRay: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed PythonDatabricks
 
Intro to Neo4j
Intro to Neo4jIntro to Neo4j
Intro to Neo4jNeo4j
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Neo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic trainingNeo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic trainingNeo4j
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Flink Forward
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 

What's hot (20)

NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
No sqlpresentation
No sqlpresentationNo sqlpresentation
No sqlpresentation
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Basics of MongoDB
Basics of MongoDB Basics of MongoDB
Basics of MongoDB
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Ray: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed PythonRay: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed Python
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Intro to Neo4j
Intro to Neo4jIntro to Neo4j
Intro to Neo4j
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Neo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic trainingNeo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j GraphDay Seattle- Sept19- neo4j basic training
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 

Viewers also liked

NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL DatabasesBADR
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and consFabio Fumarola
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenLorenzo Alberton
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 

Viewers also liked (6)

NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 

Similar to An Introduction to Big Data, NoSQL and MongoDB

NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social WebBogdan Gaza
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloudboorad
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageBethmi Gunasekara
 
DevNation Atlanta
DevNation AtlantaDevNation Atlanta
DevNation Atlantaboorad
 
How to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot ProjectHow to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot ProjectDATAVERSITY
 
Why Organizations are Looking at Alternative Database Technologies – Introduc...
Why Organizations are Looking at Alternative Database Technologies – Introduc...Why Organizations are Looking at Alternative Database Technologies – Introduc...
Why Organizations are Looking at Alternative Database Technologies – Introduc...DATAVERSITY
 
Introduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBIntroduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBAhmed Farag
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDBMongoDB
 
Solr cloud the 'search first' nosql database extended deep dive
Solr cloud the 'search first' nosql database   extended deep diveSolr cloud the 'search first' nosql database   extended deep dive
Solr cloud the 'search first' nosql database extended deep divelucenerevolution
 
Introducing MongoDB into your Organization
Introducing MongoDB into your OrganizationIntroducing MongoDB into your Organization
Introducing MongoDB into your OrganizationMongoDB
 
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalramazan fırın
 
Webinar: How Banks Manage Reference Data with MongoDB
 Webinar: How Banks Manage Reference Data with MongoDB Webinar: How Banks Manage Reference Data with MongoDB
Webinar: How Banks Manage Reference Data with MongoDBMongoDB
 
MongoDB in FS
MongoDB in FSMongoDB in FS
MongoDB in FSMongoDB
 
Big Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBig Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBrian Enochson
 

Similar to An Introduction to Big Data, NoSQL and MongoDB (20)

Anti-social Databases
Anti-social DatabasesAnti-social Databases
Anti-social Databases
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social Web
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloud
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
 
DevNation Atlanta
DevNation AtlantaDevNation Atlanta
DevNation Atlanta
 
How to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot ProjectHow to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot Project
 
Why Organizations are Looking at Alternative Database Technologies – Introduc...
Why Organizations are Looking at Alternative Database Technologies – Introduc...Why Organizations are Looking at Alternative Database Technologies – Introduc...
Why Organizations are Looking at Alternative Database Technologies – Introduc...
 
MongoDB
MongoDBMongoDB
MongoDB
 
MongoDB
MongoDBMongoDB
MongoDB
 
Introduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBIntroduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDB
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
Drop acid
Drop acidDrop acid
Drop acid
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
NoSQL
NoSQLNoSQL
NoSQL
 
Solr cloud the 'search first' nosql database extended deep dive
Solr cloud the 'search first' nosql database   extended deep diveSolr cloud the 'search first' nosql database   extended deep dive
Solr cloud the 'search first' nosql database extended deep dive
 
Introducing MongoDB into your Organization
Introducing MongoDB into your OrganizationIntroducing MongoDB into your Organization
Introducing MongoDB into your Organization
 
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-final
 
Webinar: How Banks Manage Reference Data with MongoDB
 Webinar: How Banks Manage Reference Data with MongoDB Webinar: How Banks Manage Reference Data with MongoDB
Webinar: How Banks Manage Reference Data with MongoDB
 
MongoDB in FS
MongoDB in FSMongoDB in FS
MongoDB in FS
 
Big Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBig Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and Cassasdra
 

Recently uploaded

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

An Introduction to Big Data, NoSQL and MongoDB

  • 1. Open source, high performance database Introduction to NoSQL and MongoDB Will LaForest Senior Director of 10gen Federal will@10gen.com @WLaForest 1
  • 2. SQL Dynamic invented Web Content released 10gen Web applications founded Oracle Client IBM’s IMS founded Server SOA BigTable 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 Codd publishes PC’s gain 3 tier Cloud relational model paper traction architecture Computing in 1970 Brewer’s Cap NoSQL WWW born Movement born 2
  • 3. Attribute Tuple Relation 3
  • 4. 4
  • 5. • Data stored in a RDBMS is very compact (disk was more expensive) • SQL and RDBMS made queries flexible with rigid schemas • Rigid schemas helps optimize joins and storage • Massive ecosystem of tools, libraries and integrations • Been around 40 years! 5
  • 6. • Gartner uses the 3Vs to define • Volume - Big/difficult/extreme volume is relative • Variety – Changing or evolving data – Uncontrolled formats – Does not easily adhere to a single schema – Unknown at design time • Velocity – High or volatile inbound data – High query and read operations – Low latency 6
  • 7. VOLUME & NEW ARCHITECTURES • Systems scaling horizontally, not vertically • Commodity servers • Cloud Computing DATA VARIETY & VOLATILITY • Extremely difficult to find a single fixed schema • Don’t know data schema a-priori TRANSACTIONAL MODEL • N x Inserts or updates • Distributed transactions 7
  • 8. • Non-relational been hanging around (MUMPS?) • Modern NoSQL theory and offerings started in early 2000s • Modern usage of term introduced in 2009 • NoSQL = Not Only SQL • A collection of very different products • Alternatives to relational databases when they are a bad fit • Motives – Horizontally scalable (commodity server/cloud computing) – Flexibility 8
  • 9. • Value (Data) mapped to a key (think primary) • Some typed some just BLOBs • Redis, MemcachDB, Voldemort “Will” • “@WLaForest” “Chris” • 12 “Robert” • BLOB 9
  • 10. Data stored on disk in a column oriented fashion • Predominantly hash based indexing • Data partitioned by range or consistent hashing • Google BigTable, HBase, Cassandra, Accumulo 10
  • 11. • Key-Value Stores – Key-Value Stores – Value (Data) mapped to a key (think primary) • Big Table Descendants – Looks like a distributed multi-dimension map – Data stored in a column oriented fashion – Predominantly hash based indexing • Document oriented stores – Data stored as either JSON or XML documents • What do they all have in common? 11
  • 12. • Not a database • Map reduce on HDFS or other data source • When you want to use it – Can’t use a index – Distributing custom algorithms – ETL • Great for grinding through data • Many NoSQL offerings have native map reduce functionality 12
  • 13. 13
  • 14. • 2007 – Eliot Horowitz & Dwight Merriman tired of reinventing the wheel – 10gen founded – MongoDB Development begins • 2009 – Initial release of MongoDB • 73M+ in funding – Funded by Sequoia, NEA, Union Square Ventures, Flybridge Capital 14
  • 15. • Pre-production Subscriptions for MongoDB – Fixed cost = $30k – 6 month term, unlimited servers • MongoDB Subscriptions – 32 month term, $4k per server – 24x7 support for development and production, 1 hour response time – Onboarding call and quarterly review • Consulting – Cost is $300/hr • Training – MongoDB Developer - $1500/student, 2 day course – MongoDB Administrator - $1500/student, 2 day course – MongoDB Essential - $2250/student, 3 day course • MongoDB Monitoring Service 15
  • 16. • 4 servers with 2 processors each • Total processors equals 8 • Oracle Enterprise Edition – $47,500 per processor plus maintenance – 8 x 47,500 = $380,000 + $76,000 – Total Cost = $456,000 • MongoDB Subscription – $4,000 per server (no processor count) – No license fee or maintenance charge – 4 x 4,000 = $16,000 – Total Cost = $16,000 • MongoDB is $440,000 less expensive 16
  • 17. #2 on Indeed’s Fastest Growing Jobs Jaspersoft BigData Index Demand for MongoDB, the document-oriented NoSQL database, saw the biggest spike with over 200% growth in 2011. 451 Group Google Searches “MongoDB increasing its dominance” 17
  • 18. #2 ON INDEED’S FASTEST GROWING JOBS 18
  • 19. “MongoDB INCREASING ITS DOMINANCE” 19
  • 20. • Scale horizontally over commodity hardware • RDBMSs great so keep what works – Rich data models – Adhoc queries – Fully featured indexes • What doesn’t distribute well? – Long running multi-row transactions – Joins – Both artifacts of the relational data model • Do not homogenize programming interfaces • Local storage first class citizen for DB storage 20
  • 21. Data stored as documents (JSON) • Schema free • CRUD operations – (Create Read Update Delete) • Atomic document operations • Ad hoc Queries like SQL – Equality – Regular expression searches – Ranges – Geospatial • Secondary indexes • Sharding (sometimes called partitioning) for scalability • Replication – HA and read scalability 21
  • 22. MongoDB does not need any defined data schema. • Every document could have different data! {name: “will”, name: “jeff”, {name: “brendan”, eyes: “blue”, eyes: “blue”, aliases: [“el diablo”]} birthplace: “NY”, height: 72, aliases: [“bill”, “la boss: “ben”} ciacco”], {name: “matt”, gender: ”???”, pizza: “DiGiorno”, boss: ”ben”} name: “ben”, height: 72, hat: ”yes”} boss: 555.555.1212} 22
  • 23. Seek = 5+ ms Read = really really fast Post Comment Author 23
  • 24. Post Author Comment Comment Comment Comment Comment 24
  • 25. RDBMS MongoDB Database Database Table Collection Row Document 25
  • 26. • We are running instances on the order of: - 100B objects - 50TB storage - 50K qps per server - ~1200 servers 26
  • 27. • SSL – between client and server – Intra-cluster communication • Authorization at the database level – Read Only/Read+Write/Administrator • Security Roadmap (tentative) – Pluggable authentication 2.4 – Auditing 2.4 – Cell level security 2.6 – Common Criteria certification 27
  • 28. 28
  • 29. var p = { author: “roger”, date: new Date(), text: “Spirited Away”, tags: *“Tezuka”, “Manga”+- > db.posts.save(p) 29
  • 30. >db.posts.find() { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "roger", date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)", text : "Spirited Away", tags : [ "Tezuka", "Manga" ] } Notes: - _id is unique, but can be anything you’d like 30
  • 31. Create index on any Field in Document // 1 means ascending, -1 means descending >db.posts.ensureIndex({author: 1}) >db.posts.find({author: 'roger'}) { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "roger", ... } 31
  • 32. • Conditional Operators – $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $size, $type – $lt, $lte, $gt, $gte // find posts with any tags > db.posts.find( {tags: {$exists: true }} ) // find posts matching a regular expression > db.posts.find( {author: /^rog*/i } ) // count posts by author > db.posts.find( {author: ‘roger’- ).count() 32
  • 33. • $set, $unset, $inc, $push, $pushAll, $pull, $pullAll, $bit > comment = { author: “fred”, date: new Date(), text: “Best Movie Ever”- > db.posts.update( { _id: “...” -, $push: {comments: comment} ); 33
  • 34. { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "roger", date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)", text : "Spirited Away", tags : [ "Tezuka", "Manga" ], comments : [ { author : "Fred", date : "Sat Jul 24 2010 20:51:03 GMT-0700 (PDT)", text : "Best Movie Ever" } ]} 34
  • 35. // Index nested documents > db.posts.ensureIndex( “comments.author”:1 ) db.posts.find(,‘comments.author’:’Fred’-) // Index on tags > db.posts.ensureIndex( tags: 1) > db.posts.find( { tags: ’Manga’ - ) // geospatial index > db.posts.ensureIndex( “author.location”: “2d” ) > db.posts.find( “author.location” : , $near : [22,42] } ) 35
  • 36. 36
  • 37. Native Map/Reduce in JS in MongoDB – Distributes across the cluster with good data locality • New aggregation framework – Declarative (no JS required) – Pipeline approach (like Unix ps -ax | tee processes.txt | more) • Hadoop – Intersect the indexing of MongoDB with the brute force parallelization of hadoop – Hadoop MongoDB connector 37
  • 38. 38
  • 39. $project $match $limit $skip $unwind $group $sort { db.article.aggregate( title : “this is my title” , { $project : { author : “bob” , author : 1, posted : new Date () , tags : 1, pageViews : 5 , }}, tags : [ “fun” , “good” , “fun” ] , { $unwind : "$tags" }, comments : [ { $group : { { author :“joe” , text : “this is cool” } , _id : “$tags”, { author :“sam” , text : “this is bad” } authors : { $addToSet : "$author" } ], }} other : { foo : 5 } ); } 39
  • 40. 40
  • 41. Input data MAP Intermediate data REDUCE Output data 1 1 2 2 3 3 41
  • 42. Input data MAP Intermediate data REDUCE Output data 1 1 2 2 3 3 42
  • 43. 43
  • 44. Write Primary Read Asynchronous Secondary Replication Read Driver Secondary Read 44
  • 45. Primary Secondary Read Driver Secondary Read 45
  • 46. Primary Write Primary Automatic Read Leader Election Driver Secondary Read 46
  • 47. Secondary Read Write Read Primary Driver Secondary Read 47
  • 48. Additional Details Clients MongoS MongoS MongoS Config Config Key Range Key Range Key Range Key Range 0..30 31..60 61..90 91.. 100 Config Primary Primary Primary Primary Secondary Secondary Secondary Secondary Secondary Secondary Secondary Secondary 48
  • 49. Sharding Details • Replica Set Details • Consistency Details • Common Deployment Scenarios • Citations 49
  • 50. 50
  • 51. Write Primary Secondary Read Secondary Driver Read 51
  • 52. Write Primary Driver Read Secondary Secondary 52
  • 53. 1. Write Primary Secondary 1. Replicate 2. Read Secondary Driver 2. Read 53
  • 54. • Fire and forget • Wait for error • Wait for fsync • Wait for journal sync • Wait for replication 54
  • 55. Driver Primary write apply in memory 55
  • 56. Driver Primary write getLastError apply in memory 56
  • 57. Driver Primary write getLastError apply in memory j:true Write to journal 57
  • 58. Driver Primary write getLastError apply in memory fsync:true fsync 58
  • 59. Driver Primary Secondary write getLastError apply in memory w:2 replicate 59
  • 60. Value Meaning <n:integer> Replicate to N members of replica set “majority” Replicate to a majority of replica set members <m:modeName> Use cutom error mode name 60
  • 61. 61
  • 62. > db.runCommand( { shardcollection: “test.users”, key: { email: 1 }} ) { name: “Jared”, email: “jsr@10gen.com”, } { name: “Scott”, email: “scott@10gen.com”, } { name: “Dan”, email: “dan@10gen.com”, } 62
  • 63. -∞ +∞ 63
  • 64. -∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 64
  • 65. Split! -∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 65
  • 66. This is a Split! This is a chunk chunk -∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 66
  • 67. -∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 67
  • 68. Split! -∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 68
  • 69. -∞ adam@10gen.com 1 adam@10gen.com jared@10gen.com 1 jared@10gen.com scott@10gen.com 1 scott@10gen.com +∞ 1 • Stored in the config serers • Cached in mongos • Used to route requests and keep cluster balanced 69
  • 70. mongos config balancer config Chunks! config 1 2 3 4 13 14 15 16 25 26 27 28 37 38 39 40 5 6 7 8 17 18 19 20 29 30 31 32 41 42 43 44 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48 Shard 1 Shard 2 Shard 3 Shard 4 70
  • 71. mongos config balancer config Imbalance config 1 2 3 4 5 6 7 8 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48 Shard 1 Shard 2 Shard 3 Shard 4 71
  • 72. mongos config balancer config Move chunk 1 to config Shard 2 1 2 3 4 5 6 7 8 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48 Shard 1 Shard 2 Shard 3 Shard 4 72
  • 73. mongos config balancer config config 1 2 3 4 5 6 7 8 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48 Shard 1 Shard 2 Shard 3 Shard 4 73
  • 74. mongos config balancer config Chunks 1,2, and 3 have migrated config 4 5 6 7 8 1 2 3 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48 Shard 1 Shard 2 Shard 3 Shard 4 74
  • 75. 75
  • 76. 1 1.Query arrives at mongos 4 mongos 2.mongos routes query to a single shard 3.Shard returns results 2 of query 3 4.Results returned to client Shard 1 Shard 2 Shard 3 76
  • 77. 1 1.Query arrives at mongos 4 mongos 2.mongos broadcasts query to all shards 3.Each shard returns 2 results for query 2 2 3 3 3 4.Results combined and returned to client Shard 1 Shard 2 Shard 3 77
  • 78. 1 1.Query arrives at mongos 6 2.mongos broadcasts query mongos to all shards 5 3.Each shard locally sorts results 2 4.Results returned to 2 mongos 2 4 4 4 5.mongos merge sorts individual results 3 3 3 6.Combined sorted result Shard 1 Shard 2 Shard 3 returned to client 78
  • 79. Inserts Requires shard db.users.insert({ key name: “Jared”, email: “jsr@10gen.com”}) Removes Routed db.users.delete({ email: “jsr@10gen.com”}) Scattered db.users.delete({name: “Jared”}) Updates Routed db.users.update( {email: “jsr@10gen.com”}, {$set: { state: “CA”}}) Scattered db.users.update( {state: “FZ”}, {$set:{ state: “CA”}} ) 79
  • 80. By Shard Routed db.users.find( {email: “jsr@10gen.com”}) Key Sorted by Routed in order db.users.find().sort({email:-1}) shard key Find by non Scatter Gather db.users.find({state:”CA”}) shard key Sorted by Distributed merge db.users.find().sort({state:1}) sort non shard key 80
  • 81. 81
  • 82. Data Center Primary Secondary Secondary 82
  • 83. Data Center Primary Secondary Secondary hidden=true backups 83
  • 84. Active Data Center Standby Data Center Primary Secondary Secondary priority = 1 priority = 1 84
  • 85. West Coast DC Central DC East Coast DC Secondary Primary Secondary priority = 1 85
  • 86. 86
  • 87. History of Database Management (http://bit.ly/w3r0dv) • EMC IDC Study (http://bit.ly/y1mJgJ) • Gartner & Big Data (http://bit.ly/xvRP3a) • SQL (http://en.wikipedia.org/wiki/SQL) • Database Management Systems http://en.wikipedia.org/wiki/Dbms) • Dynamo: Amazon’s Highly Available Key-value Store (http://bit.ly/A8F8oy) • CAP Theorem (http://bit.ly/zvA6O6) • NoSQL Google File System and BigTable (http://oreil.ly/wOXliP) • NoSQL Movement whitepaper (http://bit.ly/A8RBuJ) • Sample ERD diagram (http://bit.ly/xV30v) 87
  • 88. 88
  • 89. • Impossible for a distributed computer system to simultaneously provide all three of the following guarantees – Consistency - All nodes see the same data at the same time. – Availability - A guarantee that every request receives a response about whether it was successful or failed. – Partition tolerance - No set of failures less than total network failure is allowed to cause the system to respond incorrectly 89

Editor's Notes

  1. Database requirements are changing … because of i) volume ii) Type of data iii) Agile Development, iv) New architectures.. V) New Apps
  2. Detailed explanation of sharding in optional slides