SlideShare a Scribd company logo
1 of 58
Benchmarking
No-SQL Databases

              Alex Voss
             Chris Choi
TOP 2 Questions

  Should a social scientist buy
  MORE or UPGRADE
  computers?

  Which   DATABASE(s)?
Document Oriented
    Databases
The central concept of a
document-oriented database
is the notion of a Document
Both

          and


Uses JSON to store these
Documents
Example Documents:
                           {
{
                               FirstName:"Jonathan",
    FirstName:"Bob",
                               Address:"15 Wanamassa Point Road",
    Address:"5 Oak St.",
                               Children:[
    Hobby:"sailing"
                                   {Name:"Michael",Age:10},
}
                                   {Name:"Jennifer", Age:8},
                                   {Name:"Samantha", Age:5},
                                   {Name:"Elena", Age:2}
                               ]
                           }



there are no empty 'fields', and it does
not require explicitly stating if other
pieces of information are left out
The Database Tests
Database Test
Aggregate/Count:

   Most User Mentioned

   Most Hashed Tags

   Most Shared URLs
Database Test
Aggregate/Count:      Typical questions


   Most User Mentioned

   Most Hashed Tags

   Most Shared URLs
Has two query
             approaches:



                 Aggregation
Map Reduce   &
                 Framework
Map Reduce


     For   Aggregation

     Uses   JavaScript
Aggregation Framework


   Similar to Map Reduce

   But Uses   C++   (Faster!)
Single Node
                       SSD vs HDD

     Map Reduce                         Aggregation Framework
60                                 60
             Most User Mentioned
             Most Hashed-Tags                         Most User Mentioned
50                                 50                 Most Hashed-Tags
             Most Shared URLs
                                                      Most Shared URLs
40                                 40


30                                 30


20                                 20


10                                 10


0                                  0
     HDD      SSD                          HDD       SSD
Single Node


Aggregation Framework


    is roughly   2 times faster   than


                             Map Reduce
When we have more than 1
    node ~ Scaling

              Generally . . .
Scaling
                        BIGGER
Vertically              and
                        FATTER
                        machine!




Scaling
Horizontally
               MORE machines!
Replica-Set
Master-Slave
Replication



Replicates data
from master to
slave

                    Only one node is in
                    charge (the Master), and
                    data only travels in one
                    direction (to the Slave)
Replica-Set
                  Single Node vs Replica Set:
                   Aggregation Framework

                 25
                                                      Single Node
                                                      Replica Set
                 20



                 15
Time (minutes)




                 10



                 5



                 0
                      User Mentioned   Hashed-Tags   Shared URLs
Replica-Set
                  Single Node vs Replica Set:
                          Map Reduce

                 80
                                                     Single Node
                 70                                  Replica Set

                 60

                 50
Time (minutes)




                 40

                 30

                 20

                 10

                 0
                      User Mentioned   Hashed-Tags   Shared URLs
Replicating Data did not improve
        query performance
(MapReduce and Aggregation Framework) . . .
             so we tried
               Sharding
Sharding

                       One BIG stack




    Two or more smaller stacks
democratic
               uses a
               hierarchical scaling
               approach


Master-Slave
Replication
But Deployment is an
issue with MongoDB . . .



If you want to shard . . .
documentation suggests
             minimum     11     nodes, to build a

             fail-safe   sharded cluster

2 Shards → 6 nodes (3 each)

2 mongos (load balancer) → 2 nodes

3 mongod (config servers) → 3 nodes
-------------------------------------------

                     Total:   11 nodes
The investment to go from

single node -->   sharded cluster

is very HIGH!
By omitting Fail-Safe features




For our tests, we kind of   cheated and used:
      Note: X is variable
    2 X Shards → 6 2 nodes

    2 1 mongos (load balancer) → 2 0.5 node

    3 1 mongod (config servers) → 3 0.5 node
    -------------------------------------------

                             Total:   11 3 nodes
                                      NOT RECOMMENDED
                                      IN PRODUCTION
Which looks something like this:




             4 Shard configuration
Or this:




           8 Shard configuration
Another problem is with
querying in a sharded
environment . . .
We couldn't perform queries using
the Aggregation Framework in the
shards, because it was limited by:

- Output at every stage of the
aggregation can only contain     16MB
-   >10% RAM usage

                                 MongoDB would
                                 fail if the limits
                                 were reached
4 Cores
                                                                                 8GB RAM
                            Sharded Cluster
Map Reduce: On 3 modest machines

                                                     Most hash tags (mins)
                    30.00                            Most user mentions (mins)
                                                     Most shared urls (mins)

                    25.00


                    20.00
   Time (minutes)




                    15.00


                    10.00


                     5.00


                     0.00
                              2         4                8

                                                                                Most
                                  Number of Shards


                                                                               Optimal
4 Cores
                                                                                                      8GB RAM
                                   Sharded Cluster
                             Map Reduce: On 3 modest machines

                      Single Node                                                     Three Nodes
                                                                          60.00
                 60
                                                                                                            Most hash tags (mins)
                                   Most Hashed-Tags                                                         Most user mentions (mins)
                                                                          50.00
                 50                Most User Mentioned                                                      Most shared urls (mins)
                                   Most Shared URLs
                                                                          40.00
                 40


                                                         Time (minutes)
Time (minutes)




                 30                                                       30.00


                 20                                                       20.00


                 10                                                       10.00


                 0                                                         0.00
                       HDD            SSD                                         2            4                   8
                                                                                         Number of Shards
4 Cores
                                                                                                                    8GB RAM
                                                Sharded Cluster
                                          Map Reduce: On 3 modest machines

                                   Single Node                                                      Three Nodes
                                                                                        60.00
                      60 60
                                                 Most Hashed-Tags                                                         Most hash tags (mins)
                                                Most Hashed-Tags
                                                 Most User Mentioned                                                      Most user mentions (mins)
                                                                                        50.00
                      50 50                     Most User Mentioned
                                                 Most Shared URLs                                                         Most shared urls (mins)
                                                Most Shared URLs
                      40 40                                                             40.00



                                                                       Time (minutes)
Time (minutes)
                 Time (minutes)




                      30 30                                                             30.00


                      20 20                                                             20.00


                      10 10                                                             10.00


                             0 0                                                         0.00
                                   HDD
                                    HDD            SSD
                                                  SSD                                           2            4                   8
                                                                                                       Number of Shards
Performance of Single Node (SSD)
is very comparable to 3 nodes (HDD)
of different number of shards!
                    (Single Node with SSD is
                    2-3 minutes slower than 3
                    nodes . . . [40 GB corpus])



Perhaps its worth purchasing a good
SSD over investing more nodes to
improve aggregate performance . . .
What happens when we scale vertically

with   ONE BIG BOX?

                           we get --->
32 Cores
                                                                                        128GB RAM
                              Sharded Cluster
                         Map Reduce: On a Big Box

                 12.00
                                                            Most hash tags (mins)
                                                            Most user mentions (mins)     Most
                 10.00
                                                            Most shared urls (mins)
                                                                                         Optimal
                  8.00
Time (minutes)




                  6.00


                  4.00


                  2.00


                  0.00
                          2     4   8         16       24        28          30
                                    Number of Shards
On a   SINGLE NODE , while
                CouchBase
           is   FASTER than
        MongoDB's MapReduce

MongoDB's Aggregation Framework
           is   FASTER than
                CouchBase
That said adding more
Couchbase nodes was
  disappointing. . .
1        2   VS   3
           VS
      Node    Nodes    Nodes
                                                           One Node
90                                                         Two Nodes

                                                           Three Nodes
80


70


60


50


40


30


20


10


0
     Most user mentioned   Most hashed tags   Most shared urls
Adding more nodes did not
        improve
 query performance . . .
Other users have found similar issues,
Couchbase Support says:


 “One known issue with the current developer
preview of Couchbase is that it                                     slows down
on   smaller clusters. The views are optimal
      on   very large clusters.”                                                           [1]




  But they did not specify what is “very large”


       [1] Couchbase performance - real tests - it's horrible - or am I doing somethin wrong?
That was

November 2011 . . .
Which one?
The obvious answer is




From a performance standpoint.
Simple flat file aggregation with Java
took approx 60 mins

On the same machine, MongoDB took
approx

30 mins (MapReduce)
10 mins (Aggregation Framework)
Deployment Options
                                       140.00                                                                2500
                                             1 Node
                                                                                      Total Execution
Hard drive                                    HDD                                     Time              2200
Map Reduce                                     MR                                     Costs
                                       120.00
                                                                 1950                                        2000


                                       100.00




                                                                                                                    Approximate Total Cost (GBP)
         Total Execution Time (mins)




                                                      1 Node                                                 1500
                                        80.00          SSD
                                                        MR

                                        60.00                                  1000      1000
                                                      900                                                    1000


                                                               3 Nodes
                                        40.00650
                                                                 HDD            MR
                                                                 MR                                          500

                                        20.00                                             AF              Big Box
                                                                        1 Node #2                            HDD
                                         0.00                            Fast SSD                          0
                                                                                                             MR

                                                                                                 Aggregate
                                                                                                 Framework
But!

              Has problems with:

       - Full Text Search
       - Social Network Graphs
       - Aggregation
's Full Text Search
        is slow

1 simple query on a single
node, 44GB corpus took
approx
      10 mins!
Solution:




   A   Fast Full Text Search Engine
       The same search took only a
           fraction of a second
To create a
Social
Network
Graph with

  Can be Complex,
  since you query multiple
  times to obtain
  relationships
Solution:




   A   Popular Graph Database
        Where you can obtain a
           network graph
       with   VERY FEW queries
Example:



 START john=node:node_auto_index(name = 'John')
 MATCH john-[:friend]->()-[:friend]->fof
 RETURN john, fof


           The above query
           fetches all nodes who
           are friends of friends
           of “Johns”
With Aggregation




 Is perhaps not the best,
 we have yet to play with
Or even

          An SQL language to
          build MapReduce
          queries


With
Moral of the story



Use the right tools
for the right job
Future work
We aim to experiment with:




With possibility in integrating both
with

More Related Content

What's hot

Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and IterationsSameer Wadkar
 
Cours Big Data Chap6
Cours Big Data Chap6Cours Big Data Chap6
Cours Big Data Chap6Amal Abid
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Jay Patel
 
Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Giuseppe Paterno'
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraFolio3 Software
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterRed_Hat_Storage
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureDan McKinley
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 

What's hot (20)

Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and Iterations
 
Cours Big Data Chap6
Cours Big Data Chap6Cours Big Data Chap6
Cours Big Data Chap6
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
 
CouchDB Vs MongoDB
CouchDB Vs MongoDBCouchDB Vs MongoDB
CouchDB Vs MongoDB
 
Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache Cassandra
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
HDFS Selective Wire Encryption
HDFS Selective Wire EncryptionHDFS Selective Wire Encryption
HDFS Selective Wire Encryption
 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on gluster
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 

Viewers also liked

Couchbase Performance Benchmarking
Couchbase Performance BenchmarkingCouchbase Performance Benchmarking
Couchbase Performance BenchmarkingRenat Khasanshyn
 
Why no sql ? Why Couchbase ?
Why no sql ? Why Couchbase ?Why no sql ? Why Couchbase ?
Why no sql ? Why Couchbase ?Ahmed Rashwan
 
Introduction to Couchbase Server 2.0
Introduction to Couchbase Server 2.0Introduction to Couchbase Server 2.0
Introduction to Couchbase Server 2.0Dipti Borkar
 
Evaluating NoSQL Performance: Time for Benchmarking
Evaluating NoSQL Performance: Time for BenchmarkingEvaluating NoSQL Performance: Time for Benchmarking
Evaluating NoSQL Performance: Time for BenchmarkingSergey Bushik
 
Mythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDBMythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDBMongoDB
 
Introduction to Database Benchmarking with Benchmark Factory
Introduction to Database Benchmarking with Benchmark FactoryIntroduction to Database Benchmarking with Benchmark Factory
Introduction to Database Benchmarking with Benchmark FactoryMichael Micalizzi
 
CouchConf Tokyo DOCOMO Innovations Lunchtime Lightning Talk (English)
CouchConf Tokyo DOCOMO Innovations Lunchtime Lightning Talk (English)CouchConf Tokyo DOCOMO Innovations Lunchtime Lightning Talk (English)
CouchConf Tokyo DOCOMO Innovations Lunchtime Lightning Talk (English)DOCOMO Innovations, Inc.
 
MongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark
MongoDB 3.0.0 vs 2.6.x vs 2.4.x BenchmarkMongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark
MongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark承翰 蔡
 
MongoDB MapReduce Business Intelligence
MongoDB MapReduce Business IntelligenceMongoDB MapReduce Business Intelligence
MongoDB MapReduce Business IntelligenceShafaq Abdullah
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membaseArdak Shalkarbayuli
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking TutorialTilmann Rabl
 
Lightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at CogentaLightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at CogentaYann Cluchey
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB MongoDB
 

Viewers also liked (16)

Methods of NoSQL database systems benchmarking
Methods of NoSQL database systems benchmarkingMethods of NoSQL database systems benchmarking
Methods of NoSQL database systems benchmarking
 
Couchbase Performance Benchmarking
Couchbase Performance BenchmarkingCouchbase Performance Benchmarking
Couchbase Performance Benchmarking
 
Why no sql ? Why Couchbase ?
Why no sql ? Why Couchbase ?Why no sql ? Why Couchbase ?
Why no sql ? Why Couchbase ?
 
Introduction to Couchbase Server 2.0
Introduction to Couchbase Server 2.0Introduction to Couchbase Server 2.0
Introduction to Couchbase Server 2.0
 
Evaluating NoSQL Performance: Time for Benchmarking
Evaluating NoSQL Performance: Time for BenchmarkingEvaluating NoSQL Performance: Time for Benchmarking
Evaluating NoSQL Performance: Time for Benchmarking
 
Mythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDBMythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDB
 
Introduction to Database Benchmarking with Benchmark Factory
Introduction to Database Benchmarking with Benchmark FactoryIntroduction to Database Benchmarking with Benchmark Factory
Introduction to Database Benchmarking with Benchmark Factory
 
CouchConf Tokyo DOCOMO Innovations Lunchtime Lightning Talk (English)
CouchConf Tokyo DOCOMO Innovations Lunchtime Lightning Talk (English)CouchConf Tokyo DOCOMO Innovations Lunchtime Lightning Talk (English)
CouchConf Tokyo DOCOMO Innovations Lunchtime Lightning Talk (English)
 
MongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark
MongoDB 3.0.0 vs 2.6.x vs 2.4.x BenchmarkMongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark
MongoDB 3.0.0 vs 2.6.x vs 2.4.x Benchmark
 
Database Hardware Benchmarking
Database Hardware BenchmarkingDatabase Hardware Benchmarking
Database Hardware Benchmarking
 
MongoDB MapReduce Business Intelligence
MongoDB MapReduce Business IntelligenceMongoDB MapReduce Business Intelligence
MongoDB MapReduce Business Intelligence
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
 
Lightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at CogentaLightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at Cogenta
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
MongoDB and hadoop
MongoDB and hadoopMongoDB and hadoop
MongoDB and hadoop
 

Similar to Benchmarking MongoDB and CouchBase

MongoDB Best Practices in AWS
MongoDB Best Practices in AWS MongoDB Best Practices in AWS
MongoDB Best Practices in AWS Chris Harris
 
HBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationSchubert Zhang
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBen Stopford
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Scaling up java applications on windows
Scaling up java applications on windowsScaling up java applications on windows
Scaling up java applications on windowsJuarez Junior
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...In-Memory Computing Summit
 
2014 05-07-fr - add dev series - session 6 - deploying your application-2
2014 05-07-fr - add dev series - session 6 - deploying your application-22014 05-07-fr - add dev series - session 6 - deploying your application-2
2014 05-07-fr - add dev series - session 6 - deploying your application-2MongoDB
 
Top Technology Trends
Top Technology Trends Top Technology Trends
Top Technology Trends InnoTech
 
Hadoop on a personal supercomputer
Hadoop on a personal supercomputerHadoop on a personal supercomputer
Hadoop on a personal supercomputerPaul Dingman
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 
Ceph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Community
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheDavid Grier
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL storeEdward Capriolo
 
Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeEdward Capriolo
 
Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeEdward Capriolo
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfsNAVER D2
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 

Similar to Benchmarking MongoDB and CouchBase (20)

MongoDB Best Practices in AWS
MongoDB Best Practices in AWS MongoDB Best Practices in AWS
MongoDB Best Practices in AWS
 
HBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance Evaluation
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java Database
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Scaling up java applications on windows
Scaling up java applications on windowsScaling up java applications on windows
Scaling up java applications on windows
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
2014 05-07-fr - add dev series - session 6 - deploying your application-2
2014 05-07-fr - add dev series - session 6 - deploying your application-22014 05-07-fr - add dev series - session 6 - deploying your application-2
2014 05-07-fr - add dev series - session 6 - deploying your application-2
 
Top Technology Trends
Top Technology Trends Top Technology Trends
Top Technology Trends
 
Hadoop on a personal supercomputer
Hadoop on a personal supercomputerHadoop on a personal supercomputer
Hadoop on a personal supercomputer
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Ceph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Performance Profiling and Reporting
Ceph Performance Profiling and Reporting
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL store
 
Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL store
 
Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL store
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 

Benchmarking MongoDB and CouchBase

  • 1. Benchmarking No-SQL Databases Alex Voss Chris Choi
  • 2. TOP 2 Questions Should a social scientist buy MORE or UPGRADE computers? Which DATABASE(s)?
  • 3. Document Oriented Databases
  • 4. The central concept of a document-oriented database is the notion of a Document
  • 5. Both and Uses JSON to store these Documents
  • 6. Example Documents: { { FirstName:"Jonathan", FirstName:"Bob", Address:"15 Wanamassa Point Road", Address:"5 Oak St.", Children:[ Hobby:"sailing" {Name:"Michael",Age:10}, } {Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5}, {Name:"Elena", Age:2} ] } there are no empty 'fields', and it does not require explicitly stating if other pieces of information are left out
  • 8. Database Test Aggregate/Count: Most User Mentioned Most Hashed Tags Most Shared URLs
  • 9. Database Test Aggregate/Count: Typical questions Most User Mentioned Most Hashed Tags Most Shared URLs
  • 10.
  • 11. Has two query approaches: Aggregation Map Reduce & Framework
  • 12. Map Reduce For Aggregation Uses JavaScript
  • 13. Aggregation Framework Similar to Map Reduce But Uses C++ (Faster!)
  • 14. Single Node SSD vs HDD Map Reduce Aggregation Framework 60 60 Most User Mentioned Most Hashed-Tags Most User Mentioned 50 50 Most Hashed-Tags Most Shared URLs Most Shared URLs 40 40 30 30 20 20 10 10 0 0 HDD SSD HDD SSD
  • 15. Single Node Aggregation Framework is roughly 2 times faster than Map Reduce
  • 16. When we have more than 1 node ~ Scaling Generally . . .
  • 17. Scaling BIGGER Vertically and FATTER machine! Scaling Horizontally MORE machines!
  • 18. Replica-Set Master-Slave Replication Replicates data from master to slave Only one node is in charge (the Master), and data only travels in one direction (to the Slave)
  • 19. Replica-Set Single Node vs Replica Set: Aggregation Framework 25 Single Node Replica Set 20 15 Time (minutes) 10 5 0 User Mentioned Hashed-Tags Shared URLs
  • 20. Replica-Set Single Node vs Replica Set: Map Reduce 80 Single Node 70 Replica Set 60 50 Time (minutes) 40 30 20 10 0 User Mentioned Hashed-Tags Shared URLs
  • 21. Replicating Data did not improve query performance (MapReduce and Aggregation Framework) . . . so we tried Sharding
  • 22. Sharding One BIG stack Two or more smaller stacks
  • 23. democratic uses a hierarchical scaling approach Master-Slave Replication
  • 24. But Deployment is an issue with MongoDB . . . If you want to shard . . .
  • 25. documentation suggests minimum 11 nodes, to build a fail-safe sharded cluster 2 Shards → 6 nodes (3 each) 2 mongos (load balancer) → 2 nodes 3 mongod (config servers) → 3 nodes ------------------------------------------- Total: 11 nodes
  • 26. The investment to go from single node --> sharded cluster is very HIGH!
  • 27. By omitting Fail-Safe features For our tests, we kind of cheated and used: Note: X is variable 2 X Shards → 6 2 nodes 2 1 mongos (load balancer) → 2 0.5 node 3 1 mongod (config servers) → 3 0.5 node ------------------------------------------- Total: 11 3 nodes NOT RECOMMENDED IN PRODUCTION
  • 28. Which looks something like this: 4 Shard configuration
  • 29. Or this: 8 Shard configuration
  • 30. Another problem is with querying in a sharded environment . . .
  • 31. We couldn't perform queries using the Aggregation Framework in the shards, because it was limited by: - Output at every stage of the aggregation can only contain 16MB - >10% RAM usage MongoDB would fail if the limits were reached
  • 32. 4 Cores 8GB RAM Sharded Cluster Map Reduce: On 3 modest machines Most hash tags (mins) 30.00 Most user mentions (mins) Most shared urls (mins) 25.00 20.00 Time (minutes) 15.00 10.00 5.00 0.00 2 4 8 Most Number of Shards Optimal
  • 33. 4 Cores 8GB RAM Sharded Cluster Map Reduce: On 3 modest machines Single Node Three Nodes 60.00 60 Most hash tags (mins) Most Hashed-Tags Most user mentions (mins) 50.00 50 Most User Mentioned Most shared urls (mins) Most Shared URLs 40.00 40 Time (minutes) Time (minutes) 30 30.00 20 20.00 10 10.00 0 0.00 HDD SSD 2 4 8 Number of Shards
  • 34. 4 Cores 8GB RAM Sharded Cluster Map Reduce: On 3 modest machines Single Node Three Nodes 60.00 60 60 Most Hashed-Tags Most hash tags (mins) Most Hashed-Tags Most User Mentioned Most user mentions (mins) 50.00 50 50 Most User Mentioned Most Shared URLs Most shared urls (mins) Most Shared URLs 40 40 40.00 Time (minutes) Time (minutes) Time (minutes) 30 30 30.00 20 20 20.00 10 10 10.00 0 0 0.00 HDD HDD SSD SSD 2 4 8 Number of Shards
  • 35. Performance of Single Node (SSD) is very comparable to 3 nodes (HDD) of different number of shards! (Single Node with SSD is 2-3 minutes slower than 3 nodes . . . [40 GB corpus]) Perhaps its worth purchasing a good SSD over investing more nodes to improve aggregate performance . . .
  • 36. What happens when we scale vertically with ONE BIG BOX? we get --->
  • 37. 32 Cores 128GB RAM Sharded Cluster Map Reduce: On a Big Box 12.00 Most hash tags (mins) Most user mentions (mins) Most 10.00 Most shared urls (mins) Optimal 8.00 Time (minutes) 6.00 4.00 2.00 0.00 2 4 8 16 24 28 30 Number of Shards
  • 38.
  • 39. On a SINGLE NODE , while CouchBase is FASTER than MongoDB's MapReduce MongoDB's Aggregation Framework is FASTER than CouchBase
  • 40. That said adding more Couchbase nodes was disappointing. . .
  • 41. 1 2 VS 3 VS Node Nodes Nodes One Node 90 Two Nodes Three Nodes 80 70 60 50 40 30 20 10 0 Most user mentioned Most hashed tags Most shared urls
  • 42. Adding more nodes did not improve query performance . . .
  • 43. Other users have found similar issues, Couchbase Support says: “One known issue with the current developer preview of Couchbase is that it slows down on smaller clusters. The views are optimal on very large clusters.” [1] But they did not specify what is “very large” [1] Couchbase performance - real tests - it's horrible - or am I doing somethin wrong?
  • 46. The obvious answer is From a performance standpoint.
  • 47. Simple flat file aggregation with Java took approx 60 mins On the same machine, MongoDB took approx 30 mins (MapReduce) 10 mins (Aggregation Framework)
  • 48. Deployment Options 140.00 2500 1 Node Total Execution Hard drive HDD Time 2200 Map Reduce MR Costs 120.00 1950 2000 100.00 Approximate Total Cost (GBP) Total Execution Time (mins) 1 Node 1500 80.00 SSD MR 60.00 1000 1000 900 1000 3 Nodes 40.00650 HDD MR MR 500 20.00 AF Big Box 1 Node #2 HDD 0.00 Fast SSD 0 MR Aggregate Framework
  • 49. But! Has problems with: - Full Text Search - Social Network Graphs - Aggregation
  • 50. 's Full Text Search is slow 1 simple query on a single node, 44GB corpus took approx 10 mins!
  • 51. Solution: A Fast Full Text Search Engine The same search took only a fraction of a second
  • 52. To create a Social Network Graph with Can be Complex, since you query multiple times to obtain relationships
  • 53. Solution: A Popular Graph Database Where you can obtain a network graph with VERY FEW queries
  • 54. Example: START john=node:node_auto_index(name = 'John') MATCH john-[:friend]->()-[:friend]->fof RETURN john, fof The above query fetches all nodes who are friends of friends of “Johns”
  • 55. With Aggregation Is perhaps not the best, we have yet to play with
  • 56. Or even An SQL language to build MapReduce queries With
  • 57. Moral of the story Use the right tools for the right job
  • 58. Future work We aim to experiment with: With possibility in integrating both with