SlideShare a Scribd company logo
1 of 100
MongoDB Aggregation
Performance.
John Page – Principal Consulting engineer
The Aggregation Framework
• What is it?
• When should I use it?
• What can it do and not do?
• When should I use it instead of an RDBMS
What is the Aggregation Framework
• It’s a data transformation pipeline.
• It ultimately a Turing complete functional language.
• It’s SELECT AS JOIN GROUP BY HAVING.
• It’s a fun challenge to use.
What can it do? and not do?
• It can read and examine documents and apply logic to them and
create new ones.
• Technically – it can do almost anything.
• Mine Bitcoins.
• Learn (in the AI sense).
• Emulate / Transpile SQL statements.
• Generate graphics.
• Run simulations.
• It can’t currently edit existing data in place.
When should I definitely use it.
• When the data’s in MongoDB and you don’t want to copy it.
• When you want to report on live data.
• When your application operations require more than find()
When should I use it versus my RDBMS?
• That’s a very good question.
Received Wisdom
• Conventional wisdom says RDBMS is just better
• Optimized for reporting.
Let us take a scenario
• You have a set of data
• You want to Report on it and Analyze it
• This data isn’t live – so we don’t need to worry about that.
• There may be a lot of it.
Our Test Data Set
• Large and Meaningful
Data Details
• Every medical practice in England
• 10+ years available month by month
• Quantity and cost of each item prescribed and number of scripts.
• >100 million+ rows a year
Relational Schema
Document Model
The Hardware
• Centos 7 – on Amazon EC2
• 32GB RAM
• 4 CPU Cores
• Databases on 2000 IOPS 400GB Disk
• Temp files on 1200 IOPS 400GB Disk
In the Blue corner
• MySQL Version 8.0
• Out of the box defaults
• Cache (innodb_buffer_pool) set to 80% of RAM
• 3 Tables (13GB)
• Indexing as required
In the Green Corner
• MongoDB 4.0.3
• Cache set to default (50% - 1GB)
• 1 Collection
• 15 GB of BSON
• 5GB on disk due to Snappy.
Round 1
How much did the UK Spend in 2017?
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
The Result
The Result
£ 8,309,203,021.46
The other Result
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
37 Seconds 54 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
Row Format
RDBMS
• Fixed length values
• Known column offsets
• Fast to find COLUMN.
• Fast to next ROW.
• Expensive to change .
MongoDB
• Dynamic documents
• Traverse from start
• Known sizes
• Depth Matters
• More flexibility
1 Bob 3.5 18-7-1972 NULL
2 Sally 8.9 15-3-1984 “Magic”
Row Format
RDBMS
• Fixed length values
• Known column offsets
• Fast to find COLUMN.
• Fast to next ROW.
• Expensive to change .
MongoDB
• Dynamic documents
• Traverse from start
• Known sizes
_id:int 1 name: str(3) “bob” size:double 3.5 when: date 18-7-1972
Row Format
RDBMS
• Fixed length values
• Known column offsets
• Fast to find COLUMN.
• Fast to next ROW.
• Expensive to change .
MongoDB
• Dynamic documents
• Traverse from start
• Known sizes
• Hierarchy Matters
• Much more flexibility
_id:int 1 name: str(3) bob sizes: array(256) [
double 3.5.
double 10.
double 1.2,
double 99]
when: date 18-7-1972
What about an Index?
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
21 Seconds (vs. 37) 54 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
Apply Covering
Index Here
But can’t MongoDB index cover too?
• Yes, but not when it’s a Multikey (array) index
• We only index unique values index only once
• So the index cannot recreate the array
Can we fix that?
• What if we flatten the data?
• Lots of redundancy
• Collection is now 200% larger
• Normalisation?
db.prescriptions.aggregate([
{$unwind:”$prescriptions”},
{$project:{_id:0}},
{$out: “tabular”}
])
db.tabular.createIndex({‘prescriptions.cost’:1})
Flat, wide data.
• 110 M Rows
• 51 GB as BSON
• 15 GB Compressed
• Not really tabular.
• 860 MB Index
• Prefix Compression
• Super space efficient
Query Performance when flattened.
MongoDB No covering Index MongoDB With covering Index
509 Seconds (vs 54) 509 Seconds
6% CPU 1700IOPS 30MB/s 6% CPU 1700IOPS 30MB/s
Query Performance when flattened.
• That doesn’t look right.
MongoDB No Index Index
509 Seconds (vs 54) 509 Seconds
6% CPU 1700IOPS 30MB/s 6% CPU 1700IOPS 30MB/s
"queryPlanner" : {
"winningPlan" : {
"stage" : "COLLSCAN”,
"direction" : "forward”
}}
Flat, wide data.
• Need to persuade aggregation to use the index
• Add a query ( cost > 0) or sort by cost at the start
• Still slower than document model ?
• Document model is efficient.
• This data is actually MOST of the database 110M Entries
• Imagine if our index was a small percentage of the data.
• Index compression has a cost when reading.
No Index Index
509 Seconds (vs 54) 177 Seconds (vs 54)
6% CPU 1700IOPS 30MB/s 25% CPU 0 IOPS 0MB/s
Table Layout
RDBMS
• Lots of fixed size rows in a file
• Nice predictable layout
MongoDB
• Variable Length rows in a file
• Less predictable layout
Table Layout – The Truth
• RDBMS and MongoDB both store records in Trees
• Records are in some ways, just like indexes.
Table Layout – The Truth
RDBMS
• Rows held in Balanced Tree
• This IS the Primary Key
• Linked leaves
MongoDB
• Docs in Balanced Tree
• Index on identity
• Can only walk the tree
• Slower to collection scan
• Less lock contention
Table Layout – The Truth
RDBMS
• Rows held in Balanced Tree
• This IS the Primary Key
• Linked leaves
MongoDB
• Docs in Balanced Tree
• Organised by Identity (int64)
• No links between leaves
• Slower to scan everything
• Much less lock contention
In-Document rollup.
• We have multiple data items in each document.
• Add summaries of cost in each document?
• No cost when updating anyway $max,$min,$sum,$count.
• RDBMS equivalent has big cost.
• You need to know in advance, or add as needed
• Like an RDBMS index
• What if we index the in-document rollup?
MongoDB with in document roll-up.
No Index on IDI Index on IDI
18 Seconds (Versus 54, or 21 in RDBMS) 18 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0MB/s
MongoDB with in document roll-up.
No Index on IDI Index on IDI
18 Seconds (Versus 54, or 21 in RDBMS) 0.01 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0MB/s
So far…
• When Data fits in RAM
• RDBMS Table scan faster than Mongo Collection scan
• RDBMS Index scan faster than RDMBS Table scan
• Large MongoDB Index scan isn’t solution
• In document rollups beat RDBMS Index scan
• Index scan of in-document rollups is really quick.
Round 2
What if it wasn’t all about the CPU?
• Data Lakes and “Big Data”
• Limited by reading data from disk
• Limited by Parallelism
• New Experiment Time.
• Reduce RAM to much less than Data Size*
• Work with disk bound data.
• Still one CPU.
Table/Collection scan from Disk
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
157 Seconds 61 Seconds
3.5% CPU 1253 IOPS 103 MB/s 25% CPU 650 IOPS 103 MB/s
Why is MongoDB faster
MySQL
• Data Size = 15 GB
MongoDB
• Data Size = 5GB*
• Minimal decompression overhead
*Not ‘Big’ Data
Index scan from Disk
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
41 Seconds 61 Seconds
25% CPU 1020 IOPS 103 MB/s 25% CPU 650 IOPS 103 MB/s
Add an
Index
Battle Royale
More Complex Queries
From RAM
• May still use Disk for temp tables,
Storage etc.
• All Tables and Indexes fit in RAM
From DISK
Data does NOT fit in RAM
Some indexes MAY be in RAM
No indexes used for MongoDB
With Group BY (RAM)
select sum(cost)
from prescriptions
group by period;
{ $group : {
_id: "$period",
t : {$sum :
{ $sum:
"$prescriptions.cost"}
}}}
63 Seconds 60 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
30 Seconds (Index)
25% CPU 0 IOPS 0 MB/s
With Group BY (Disk)
select sum(cost)
from prescriptions
group by period;
{ $group : {
_id: "$period",
t : {$sum :
{ $sum:
"$prescriptions.cost"}
}}}
39 Seconds (index)
18% CPU 1010 IOPS 103 MB/s
160 Seconds 63 Seconds
10% CPU 1020 IOPS 103 MB/s 24% CPU 650 IOPS 82 MB/s
Top 10 practices by spend (RAM)
SELECT
practice, SUM(cost) AS totalspend
FROM
prescriptions
GROUP BY practice
ORDER BY totalspend DESC
LIMIT 10;
[
{ $group: { _id: "$practice”,
spend: { $sum:
{ $sum:
"$prescriptions.cost"}}}},
{ $sort: { $spend: -1}},
{ $limit: 10}
]
51 Seconds (index)
18% CPU 1010 IOPS 103 MB/s
75 Seconds 63 Seconds
10% CPU 1020 IOPS 103 MB/s 24% CPU 650 IOPS 82 MB/s
Top 10 practices by spend (Disk)
SELECT
practice, SUM(cost) AS totalspend
FROM
prescriptions
GROUP BY practice
ORDER BY totalspend DESC
LIMIT 10;
[
{ $group: { _id: "$practice”,
spend: { $sum:
{ $sum:
"$prescriptions.cost"}}}},
{ $sort: { $spend: -1}},
{ $limit: 10}
]
160 Seconds 64 Seconds
10% CPU 1150 IOPS 104 MB/s 25% CPU 650 IOPS 82 MB/s
55 Seconds (indexed)
21% CPU 724 IOPS 77 MB/s
£ per patient – JOIN and Group (RAM)
SELECT
practice,
SUM(cost / numpatients) AS
totalspend, AVG(numpatients)
FROM
nhs.prescriptions pr,
nhs.patientcounts pc
WHERE
pr.practice = pc.code
GROUP BY practice
ORDER BY totalspend DESC LIMIT 10;
{ "$group" : { "_id”:"$practice",
"perpatient": {"$sum":
{"$divide":
[{"$sum”:"$prescriptions.cost"},
"$numpatients"]
}
}}},
{ "$sort": {"perpatient":-1},
{ "$limit": 10}
105 Seconds 62 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
£ per patient – JOIN and Group (Disk)
SELECT
practice,
SUM(cost / numpatients) AS
totalspend, AVG(numpatients)
FROM
nhs.prescriptions pr,
nhs.patientcounts pc
WHERE
pr.practice = pc.code
GROUP BY practice
ORDER BY totalspend DESC LIMIT 10;
{ "$group" : { "_id”:"$practice",
"perpatient": {"$sum":
{"$divide":
[{"$sum”:"$prescriptions.cost"},
"$numpatients"]
}
}}},
{ "$sort": {"perpatient":-1},
{ "$limit": 10}
160 Seconds 62 Seconds
17% CPU 1200 IOPS 103 MB/s 24% CPU 650 IOPS 82 MB/s
£/patient/county – nested SELECT (RAM)
select county,sum(totalcost) as spend,sum(patients) as
patients,sum(totalcost)/sum(patients) as costperpatient
from
(select county,sum(cost) as totalcost, avg(numpatients)
as patients
from prescriptions pr,patientcounts pc,practices pa
where pr.practice=pc.code
and pr.practice=pa.code and pa.period=pr.period
group by county,practice) as byprac
group by county
having patients > 100000
order by costperpatient desc limit 20;
db.prescriptions.aggregate([
{"$group" : {"_id" : { "county": "$address.county”,
"practice": "$practice"},"spend" : { "$sum" : {"$sum" :
"$prescriptions.cost"}}, "numpatients" : { "$avg" :
"$numpatients"}}},
{ "$group": { "_id" : "$_id.county", "spend" :{ "$sum" :
"$spend" }, "patients" : {"$sum": "$patients"}}},
{"$addFields" : { "costperpatient" : { "$divide”
:["$spend","$patients"] }}},
{"$match" : { "numpatients" : { "$gt" : 100000}}},
{"$sort" : { "costperpatient" : -1}}
,{$limit:20} ])
160 Seconds 66 Seconds
24% 0IOPS 0IOPS 24% CPU 0 IOPS 0 MBs
Result
Spend (£M) Patients Per Patient(£)
LINCOLNSHIRE 122 699309 175
WIRRAL 25 149554 172
CO DURHAM 102 596638 171
CLEVELAND 75 462593 163
ISLE OF WIGHT 25 144555 162
Result
Spend (£M) Patients Per Patient(£)
LINCOLNSHIRE 122 699309 175
WIRRAL 25 149554 172
CO DURHAM 102 596638 171
CLEVELAND 75 462593 163
ISLE OF WIGHT 25 144555 162
Spend (£M) Patients Per Patient(£)
BERKSHIRE 102 944538 108
MIDDLESEX 150 1469189 102
BRISTOL 14 145660 97
LONDON 522 5672564 92
LEEDS 9 122785 74
£/patient/county – nested SELECT (Disk)
select county,sum(totalcost) as spend,sum(patients) as
patients,sum(totalcost)/sum(patients) as costperpatient
from
(select county,sum(cost) as totalcost, avg(numpatients)
as patients
from prescriptions pr,patientcounts pc,practices pa
where pr.practice=pc.code
and pr.practice=pa.code and pa.period=pr.period
group by county,practice) as byprac
group by county
having patients > 100000
order by costperpatient desc limit 20;
db.prescriptions.aggregate([
{"$group" : {"_id" : { "county": "$address.county”,
"practice": "$practice"},"spend" : { "$sum" : {"$sum" :
"$prescriptions.cost"}}, "numpatients" : { "$avg" :
"$numpatients"}}},
{ "$group": { "_id" : "$_id.county", "spend" :{ "$sum" :
"$spend" }, "patients" : {"$sum": "$patients"}}},
{"$addFields" : { "costperpatient" : { "$divide”
:["$spend","$patients"] }}},
{"$match" : { "numpatients" : { "$gt" : 100000}}},
{"$sort" : { "costperpatient" : -1}}
,{$limit:20} ])
220 Seconds 67 Seconds
21% CPU 700 IOPS 68 MB/s 23% CPU 640 IOPS 82 MB/s
Most common drugs – REGROUP (RAM)
select bnfcode,max(name), sum(nitems) as items
from nhs.prescriptions
group
by bnfcode
order by items desc
limit 10;
{ "$unwind":"$prescriptions"},
{"$group" :
_id:‘$prescriptions.bnfcode’,
name:{$max:’$prescriptions.name’},
items:{$sum:’$prescriptions.nitems’}}},
{"$sort" : { "items" : -1}},
{"$limit":10}]
300 Seconds 262 Seconds
23% CPU 0 IOPS 0MB/s 25% CPU 0 IOPS 0MB/s
126 Seconds (Indexed)
25% CPU 0 IOPS 0MB/S
Grouping Techniques
SQL
Can take advantage of index ordering
by group key, all items with same key
come together so can process one at a
time.
1,1,1,1,1,2,2,2,2,3,3
Uses a temp table and sort when it
can’t.
MongoDB
Does not take advantage of ordering
(yet) – maintains a data structure
with all values.
Assumed you will want to group further
down the pipeline so optimised for
that.
Builds a tree (using disk) for the
values.
Result
Omeprazole_Cap E/C 20mg Acid Reflux 23,292,184
Aspirin Disper_Tab 75mg 15,361,735
Paracet_Tab 500mg 14,562,514
Amlodipine_Tab 5mg Blood Pressure 14,416,079
Atorvastatin_Tab 20mg Cholesterol 13,152,079
Lansoprazole_Cap 30mg (E/C Gran) Acid Reflux 12,906,650
Simvastatin_Tab 40mg Cholesterol 12,760,343
Metformin HCl_Tab 500mg Diabetes 11,404,331
Salbutamol_Inha 100mcg (200 D) C Asthma 10,595,100
Levothyrox Sod_Tab 100mcg Thyroid 9,312,464
Most common drugs – REGROUP (Disk)
select bnfcode,max(name), sum(nitems) as items
from nhs.prescriptions
group
by bnfcode
order by items desc
limit 10;
{ "$unwind":"$prescriptions"},
{"$group" :
_id:‘$prescriptions.bnfcode’,
name:{$max:’$prescriptions.name’},
items:{$sum:’$prescriptions.nitems’}}},
{"$sort" : { "items" : -1}},
{"$limit":10}]
1427 Seconds 262 Seconds
4% CPU 1800 IOPS 100 MB/S 24% CPU 180 IOPS 23 MB/s
192 Seconds (Index)
13% CPU 520 IOPS 56MB/s
Most Expensive Drugs - Result
Rivaroxaban_Tab 20mg Anti Coagulent £100,007,025
Apixaban_Tab 5mg Anti Coagulent £79,302,385
Fostair_Inh 100mcg/6mcg (120D) C Asthma £75,541,726
Tiotropium_Pdr For Inh Cap 18mcg COPD £66,348,167
Sitagliptin_Tab 100mg Diabetes £60,919,725
Symbicort_Turbohaler 200mcg/6mcg Asthma £44,314,887
Apixaban_Tab 2.5mg Anti Coagulent £41,290,937
Ins Lantus SoloStar_100u/ml 3ml Diabetes £41,182,602
Ezetimibe_Tab 10mg Cholesterol £40,756,234
Linagliptin_Tab 5mg Diabetes £38,503,893
Anomaly Detection – JOIN Derived (RAM)
SELECT
prescriptions.bnfcode,MAX(prescriptions.name),
prescriptions.practice,MAX(practices.name),
AVG(nitems),AVG(patientcounts.numpatients),
AVG(aveperperson),AVG((nitems / patientcounts.numpatients) /
aveperperson) AS ratio
FROM
prescriptions
LEFT JOIN
(SELECT
bnfcode, AVG(nitems / numpatients) AS aveperperson
FROM
prescriptions, patientcounts
WHERE
prescriptions.practice = patientcounts.code
GROUP BY bnfcode) AS avgs ON avgs.bnfcode = prescriptions.bnfcode
LEFT JOIN
patientcounts ON prescriptions.practice = patientcounts.code
LEFT JOIN
practices ON practices.code = prescriptions.practice
WHERE
patientcounts.numpatients > 500
AND aveperperson > 0
AND prescriptions.practice NOT IN ('Y01924')
GROUP BY prescriptions.practice,prescriptions.bnfcode
ORDER BY ratio DESC
LIMIT 10;
db.prescriptions.aggregate([{"$unwind":"$prescriptions"},{"$group":{"_id
":"$prescriptions.bnfcode","aveperperson":{"$avg":{"$divide":["$prescrip
tions.nitems","$numpatients"]}}}},{"$match":{"aveperperson":{"$ne":null}
}},{"$out":"typical"}])
db.prescriptions.aggregate([{"$match":{"numpatients":{"$gt":500},"actice
":{"$nin":["Y01924"]}}},{"$unwind":"$prescriptions"},{"$group":{"_id":{"
bnfcode":"$prescriptions.bnfcode","practice":"$practice"},"name":{"$max"
:"$prescriptions.name"},"pracicename":{"$max":"$address.name"},"nitems":
{"$sum":"$prescriptions.nitems"},"numpatients":{"$max":"$numpatients"},"
nmonths":{"$sum":1}}},{"$addFields":{"perpatient":{"$divide":[{"$divide"
:["$nitems","$nmonths"]},"$numpatients"]}}},{"$lookup":{"from":"typical"
,"localField":"_id.bnfcode","foreignField":"_id","as":"typical"}},{"$unw
ind":"$typical"},{"$addFields":{"ratio":{"$divide":["$perpatient","$typi
cal.aveperperson"]}}},{"$sort":{"ratio":-1}},{"$limit":10}])
Anomaly Detection – JOIN Derived (RAM)
SELECT
prescriptions.bnfcode,MAX(prescriptions.name),
prescriptions.practice,MAX(practices.name),
AVG(nitems),AVG(patientcounts.numpatients),
AVG(aveperperson),AVG((nitems / patientcounts.numpatients) /
aveperperson) AS ratio
FROM
prescriptions
LEFT JOIN
(SELECT
bnfcode, AVG(nitems / numpatients) AS aveperperson
FROM
prescriptions, patientcounts
WHERE
prescriptions.practice = patientcounts.code
GROUP BY bnfcode) AS avgs ON avgs.bnfcode = prescriptions.bnfcode
LEFT JOIN
patientcounts ON prescriptions.practice = patientcounts.code
LEFT JOIN
practices ON practices.code = prescriptions.practice
WHERE
patientcounts.numpatients > 500
AND aveperperson > 0
AND prescriptions.practice NOT IN ('Y01924')
GROUP BY prescriptions.practice,prescriptions.bnfcode
ORDER BY ratio DESC
LIMIT 10;
db.prescriptions.aggregate([{"$unwind":"$prescriptions"},{"$group":{"_id
":"$prescriptions.bnfcode","aveperperson":{"$avg":{"$divide":["$prescrip
tions.nitems","$numpatients"]}}}},{"$match":{"aveperperson":{"$ne":null}
}},{"$out":"typical"}])
db.prescriptions.aggregate([{"$match":{"numpatients":{"$gt":500},"actice
":{"$nin":["Y01924"]}}},{"$unwind":"$prescriptions"},{"$group":{"_id":{"
bnfcode":"$prescriptions.bnfcode","practice":"$practice"},"name":{"$max"
:"$prescriptions.name"},"pracicename":{"$max":"$address.name"},"nitems":
{"$sum":"$prescriptions.nitems"},"numpatients":{"$max":"$numpatients"},"
nmonths":{"$sum":1}}},{"$addFields":{"perpatient":{"$divide":[{"$divide"
:["$nitems","$nmonths"]},"$numpatients"]}}},{"$lookup":{"from":"typical"
,"localField":"_id.bnfcode","foreignField":"_id","as":"typical"}},{"$unw
ind":"$typical"},{"$addFields":{"ratio":{"$divide":["$perpatient","$typi
cal.aveperperson"]}}},{"$sort":{"ratio":-1}},{"$limit":10}])
2250 Seconds 1489 Seconds
24% CPU 1020 IOPS 90MB/s 25% CPU 0 IOPS 0MB/s
Results
DRUG PRACTICE RATIO
Methadone FULCRUM 297 Rehab
Trazodone CARE HOMES
MEDICAL
242 Elderly Care
Buprenorphine FULCRUM 233
Thickenup PDR CARE HOMES
MEDICAL
174
Vitrex Nitrile Gloves REETH MEDICAL 173 Preference?
Ema Film Gloves NEW SPrintwells1 168
Pro D3 Cap PALFREY HEALTH
CENTRE
123 Vitamin D?
Fultium D3 Cap MOHANTY 123 Vitamin D
Results
DRUG PRACTICE RATIO
Oxycodone Merseyside GP 108
Tamiflu Lancashire GP 56
Nicotine_Transdermal Dorset GP 33
Loperamide (Immodium) Kent GP 16
Anomaly Detection – JOIN Derived (Disk)
SELECT
prescriptions.bnfcode,MAX(prescriptions.name),
prescriptions.practice,MAX(practices.name),
AVG(nitems),AVG(patientcounts.numpatients),
AVG(aveperperson),AVG((nitems / patientcounts.numpatients) /
aveperperson) AS ratio
FROM
prescriptions
LEFT JOIN
(SELECT
bnfcode, AVG(nitems / numpatients) AS aveperperson
FROM
prescriptions, patientcounts
WHERE
prescriptions.practice = patientcounts.code
GROUP BY bnfcode) AS avgs ON avgs.bnfcode = prescriptions.bnfcode
LEFT JOIN
patientcounts ON prescriptions.practice = patientcounts.code
LEFT JOIN
practices ON practices.code = prescriptions.practice
WHERE
patientcounts.numpatients > 500
AND aveperperson > 0
AND prescriptions.practice NOT IN ('Y01924')
GROUP BY prescriptions.practice,prescriptions.bnfcode
ORDER BY ratio DESC
LIMIT 10;
db.prescriptions.aggregate([{"$unwind":"$prescriptions"},{"$group":{"_id
":"$prescriptions.bnfcode","aveperperson":{"$avg":{"$divide":["$prescrip
tions.nitems","$numpatients"]}}}},{"$match":{"aveperperson":{"$ne":null}
}},{"$out":"typical"}])
db.prescriptions.aggregate([{"$match":{"numpatients":{"$gt":500},"actice
":{"$nin":["Y01924"]}}},{"$unwind":"$prescriptions"},{"$group":{"_id":{"
bnfcode":"$prescriptions.bnfcode","practice":"$practice"},"name":{"$max"
:"$prescriptions.name"},"pracicename":{"$max":"$address.name"},"nitems":
{"$sum":"$prescriptions.nitems"},"numpatients":{"$max":"$numpatients"},"
nmonths":{"$sum":1}}},{"$addFields":{"perpatient":{"$divide":[{"$divide"
:["$nitems","$nmonths"]},"$numpatients"]}}},{"$lookup":{"from":"typical"
,"localField":"_id.bnfcode","foreignField":"_id","as":"typical"}},{"$unw
ind":"$typical"},{"$addFields":{"ratio":{"$divide":["$perpatient","$typi
cal.aveperperson"]}}},{"$sort":{"ratio":-1}},{"$limit":10}])
7848 Seconds 1655 Seconds
24% CPU 1020 IOPS 90MB/s 25% CPU 170 IOPS 24MB/s
Conclusions
• MongoDB is faster from disk when there are no indexes
• MongoDB is generally faster for more complex queries
• MongoDB fits the data-lake model nicely.
Is there a SQL
BI Connector
• Translating Proxy Server
• SQL -> MySQL Wire Protocol -> Bi Connector -> MongoDB
Aggregation
BI Connector
• Total Spend.
• Sum one column
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 157 41 61
RAM 37 21 54 150
Why slower ?
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
{"$unwind":{"includeArrayIndex":"prescr
iptions_idx","path":"$prescriptions"}},
{"$group":{"_id":{},"sum(nhs_DOT_presc
riptions_DOT_cost)":{"$sum":"$cost"},"s
um(nhs_DOT_prescriptions_DOT_cost)_coun
t":{"$sum":{"$cond":[{"$eq":[{"$ifNull"
:["$cost",null]},null]},0,1]}}}},
{"$project":{"_id":0,"sum(nhs_DOT_pres
criptions_DOT_cost)":{"$let":{"in":{"$c
ond":[{"$or":[{"$eq":[{"$ifNull":["$$ex
pr",null]},null]},{"$eq":["$$expr",0]},
{"$eq":["$$expr",false]}]},{"$literal":
null},"$sum(nhs_DOT_prescriptions_DOT_c
ost)"]},"vars":{"expr":"$sum(nhs_DOT_pr
escriptions_DOT_cost)_count"}}}}},
{"$project":{"nhs_DOT_sum(nhs_DOT_pres
criptions_DOT_cost)":"$sum(nhs_DOT_pres
criptions_DOT_cost)","_id":0}}
Why so much slower - Unwind
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
{"$unwind":{"includeArrayIndex":"prescr
iptions_idx","path":"$prescriptions"}},
{"$group":{"_id":{},"sum(nhs_DOT_presc
riptions_DOT_cost)":{"$sum":"$cost"},"s
um(nhs_DOT_prescriptions_DOT_cost)_coun
t":{"$sum":{"$cond":[{"$eq":[{"$ifNull"
:["$cost",null]},null]},0,1]}}}},
{"$project":{"_id":0,"sum(nhs_DOT_pres
criptions_DOT_cost)":{"$let":{"in":{"$c
ond":[{"$or":[{"$eq":[{"$ifNull":["$$ex
pr",null]},null]},{"$eq":["$$expr",0]},
{"$eq":["$$expr",false]}]},{"$literal":
null},"$sum(nhs_DOT_prescriptions_DOT_c
ost)"]},"vars":{"expr":"$sum(nhs_DOT_pr
escriptions_DOT_cost)_count"}}}}},
{"$project":{"nhs_DOT_sum(nhs_DOT_pres
criptions_DOT_cost)":"$sum(nhs_DOT_pres
criptions_DOT_cost)","_id":0}}
Why so much slower – SQL Standard
semantics{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
$sum : [ NULL, NULL, NULL ] =
0
SUM [NULL,NULL,NULL] =
NULL
{"$unwind":{"includeArrayIndex":"prescr
iptions_idx","path":"$prescriptions"}},
{"$group":{"_id":{},"sum(nhs_DOT_presc
riptions_DOT_cost)":{"$sum":"$cost"},"s
um(nhs_DOT_prescriptions_DOT_cost)_coun
t":{"$sum":{"$cond":[{"$eq":[{"$ifNull"
:["$cost",null]},null]},0,1]}}}},
{"$project":{"_id":0,"sum(nhs_DOT_pres
criptions_DOT_cost)":{"$let":{"in":{"$c
ond":[{"$or":[{"$eq":[{"$ifNull":["$$ex
pr",null]},null]},{"$eq":["$$expr",0]},
{"$eq":["$$expr",false]}]},{"$literal":
null},"$sum(nhs_DOT_prescriptions_DOT_c
ost)"]},"vars":{"expr":"$sum(nhs_DOT_pr
escriptions_DOT_cost)_count"}}}}},
{"$project":{"nhs_DOT_sum(nhs_DOT_pres
criptions_DOT_cost)":"$sum(nhs_DOT_pres
criptions_DOT_cost)","_id":0}}
BI Connector
• Total Spend By Period.
• Sum one column Group by Primary Key
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 160 39 63
RAM 63 30 60 187
BI Connector
• Total Spend By PRACTICE.
• Sum one column Group by PART OF KEY
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 160 55 62
RAM 75 51 62 230
BI Connector
• Total Spend By PATIENT.
• Sum one column Group BY JOINED FIELD
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 160 62
RAM 105 62 230
BI Connector
• AVG Spend By SPEND PER PATIENT PER COUNTY .
• Sum one column Group BY COMPUTED FIELD
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 220 67
RAM 160 66 220
BI Connector
• MOST prescribed drugs.
• Sum one column Group BY Not PK
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 1427 192 262
RAM 300 126 262 280
BI Connector
• Anamoly Detection.
• Sum one column Group from subquery Joined table
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 7848 1655
RAM 2250 1489 D.N.F !
BI Connector – Did Not Finish – why?
• Anomaly Detection.
• Did Run – but was taking a long time
• Using expressive $lookup for the join
$lookup: {
from: ”prescriptions”,
let: { drug : “aspirin” },
pipeline: [
group total by drugname,practice,
divide by practise size,
group by drugname,
match $$drug
],
as: <output array field>
}
BI Connector – why Did Not Finish
• Anomaly Detection.
• Did run – but was taking a long time
• Using expressive $lookup for the join
• No Index on in-memory table
• Hand written version
• used temp collection
• Made use _id was lookup field
$lookup: {
from: ”prescriptions”,
let: { drug : “aspirin” },
pipeline: [
group total by drugname,practice,
divide by practise size,
group by drugname,
match $$drug
],
as: <output array field>
}
Conclusions
• BI Connector a little slower than RDBMS for simple queries
• The more complicated the query, the faster it is relatively.
• It’s not as quick as hand crafted Aggregation
• But you can put that in views.
• But it’s very convenient
• You can use your existing BI Tooling
• And you could always use Charts instead.
Is there an Elephant in
the room?
External compute engines
• Spark
• Hadoop
• R
• Python
• Java / C
Pros
• Simpler to write much more complicated processing.
• Lots of libraries of pre written code
• Able to perform a lot of in-memory computation
• MongoDB can send them data very, very quickly
Cons
• Costs of transferring from or inside
clouds
• Atlas
• AWS
• Network speed limitations.
• Additional hardware.
• Security considerations.
AWS Same Region 1 cent / GB
AWS Between
Regions
9 cents / GB
AWS Out to 11 cents / GB
So do I use Spa^HR^doop! Or not?
• Yes – those tools are great for many things
• But always push computation DOWN to MongoDB if you can
• There is a balance
• Amount of effort to write as a Pipeline
• Reduced network costs in time and money
Simple Example
• Pearsons RHO
• Degree of correlation between two numeric lists
• Lets compare Lattitude (North vs South)
• And Quantity of drug per person
• Hypothesis “For some drugs, more is prescribed as you travel, up or down the UK”
Step 1 - Geocoding
• We need to augment our records with Lat/Long
• Download a handy set of postcode centroids
• mongoimport --type csv --headerline -d nhs postcodes.csv
• Use $lookup and $out to attach to each record and make new collection.
Simple Geocoding
tidypc = {$addFields: { "address.postcode" : {$rtrim:{input:"$address.postcode"}}}}
Geocode = { $lookup :{
from: "postcodes",
localField: "address.postcode",
foreignField: "Postcode",
as: "location"
}}
firstelement = { $addFields : { location : { $arrayElemAt : [ "$location" , 0 ]} }}
choosefields = { $project : { practice:1, numpatients:1, lon: "$location.Longitude",
lat: "$location.Latitude","prescriptions.name":1,"prescriptions.nitems":1}}
Step 2 – Group by drug
• For each drug compute average quantity/10,000 patients per
surgery
• Group to one record per drug with an array of objects
{
drug : “Aspirin”,
prescribed : [
{ where : [ -3.5, 55.2],
per10k : 75.4 }
…
]
}
Group by BNF code
unwind ={ $unwind:"$prescriptions"}
regroup = { $group : {
_id : "$prescriptions.name",
p : { $push : { where : ["$lon","$lat"] ,
per10k : { $multiply: [10000,
{$divide : [ "$prescriptions.nitems",
"$numpatients"]}]}}}}}
Step 1 – Pearsons Rho
• Compute RHO on Array comparing per10K to latitude.
Computing Pearsons RHO
sumcolumns = {$addFields : { s : { $reduce : { input : "$p",
initialValue: {count:0,suml:0,sumt:0,sumlsquared:0,sumtsquared:0,sumtl:0},
in : {
count : { $add : [ "$$value.count", 1]},
suml : { $add : [ "$$value.suml", "$$this.l" ]},
sumt : { $add : [ "$$value.sumt", "$$this.t" ]},
sumlsquared : { $add : [ "$$value.sumlsquared", { $multiply : ["$$this.l" ,"$$this.l"]}]},
sumtsquared : { $add : [ "$$value.sumtsquared", { $multiply : ["$$this.t" ,"$$this.t"]}]},
sumtl : { $add : [ "$$value.sumtl", { $multiply : ["$$this.t" ,"$$this.l"]}]},
}
}}}}
Computing Pearsons RHO
multiply_suml_sumt = { $multiply : [ "$s.suml","$s.sumt"] }
multiply_sumtl_count = { $multiply : ["$s.sumtl","$s.count"]}
partone = { $subtract : [ multiply_sumtl_count, multiply_suml_sumt ]}
multiply_sumlsquared_count = { $multiply : ["$s.sumlsquared","$s.count"]}
suml_squared = { $multiply : ["$s.suml","$s.suml"]}
subparttwo = { $subtract : [ multiply_sumlsquared_count,suml_squared ]}
multiply_sumtsquared_count = { $multiply : ["$s.sumtsquared","$s.count"]}
sumt_squared = { $multiply : ["$s.sumt","$s.sumt"]}
subpartthree = { $subtract : [ multiply_sumtsquared_count,sumt_squared ]}
parttwo = { $sqrt : {$multiply : [ subparttwo,subpartthree ]}}
rho = {$addFields : { rho: {$divide : [partone,parttwo]}}}
Sort by rho and output
MongoDB Enterprise > db.result.find({},{_id:1,rho:1}).sort({rho:-1}).limit(5)
{ "_id" : "Audmonal_Cap 60mg", "rho" : 0.32690588090961403 }
{ "_id" : "ExoCream 500g", "rho" : 0.32119819297625635 }
{ "_id" : "Luventa XL_Cap 24mg", "rho" : 0.2593870002284348 }
{ "_id" : "Finetest Lite (Reagent)_Strips", "rho" : 0.2518374958339396 }
{ "_id" : "Campral EC_Tab 333mg", "rho" : 0.24376784724040662 }
MongoDB Enterprise > db.result.find({},{_id:1,rho:1}).sort({rho:1}).limit(5)
{ "_id" : "Ultra Lite 10cm x 4.5m M/Layer C", "rho" : -0.258189560181513 }
{ "_id" : "Triptorelin Embon_Inj 22.5mg Vl", "rho" : -0.13752453990107172 }
Show me that Gradient
Show me that Gradient
Can I see it on a map
Can I see it on a map
Conclusions
• The Aggregation Framework is fast.
• There is no truth to “RDBMS is Just better”
• It’s a good choice for non trivial, ad-hoc queries.
• It’s a good choice for large data sets
• Consider sharding and microsharding.
• In a Cloud world – push work to the database
• Even with R/SAS/Spark! Etc.

More Related Content

What's hot

Basics of MongoDB
Basics of MongoDB Basics of MongoDB
Basics of MongoDB Habilelabs
 
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective IndexingMongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective IndexingMongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Indexing and Performance Tuning
Indexing and Performance TuningIndexing and Performance Tuning
Indexing and Performance TuningMongoDB
 
MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329Douglas Duncan
 
Advanced MySQL Query Tuning
Advanced MySQL Query TuningAdvanced MySQL Query Tuning
Advanced MySQL Query TuningAlexander Rubin
 
Introduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerIntroduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerMydbops
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDBvaluebound
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performanceoysteing
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performanceoysteing
 
Mongo Nosql CRUD Operations
Mongo Nosql CRUD OperationsMongo Nosql CRUD Operations
Mongo Nosql CRUD Operationsanujaggarwal49
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architectureBishal Khanal
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBLee Theobald
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance TuningPuneet Behl
 

What's hot (20)

Mongo DB
Mongo DB Mongo DB
Mongo DB
 
MongodB Internals
MongodB InternalsMongodB Internals
MongodB Internals
 
Basics of MongoDB
Basics of MongoDB Basics of MongoDB
Basics of MongoDB
 
Introduction to mongodb
Introduction to mongodbIntroduction to mongodb
Introduction to mongodb
 
Python and MongoDB
Python and MongoDBPython and MongoDB
Python and MongoDB
 
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective IndexingMongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Indexing and Performance Tuning
Indexing and Performance TuningIndexing and Performance Tuning
Indexing and Performance Tuning
 
MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329
 
Advanced MySQL Query Tuning
Advanced MySQL Query TuningAdvanced MySQL Query Tuning
Advanced MySQL Query Tuning
 
Introduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerIntroduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizer
 
Indexing
IndexingIndexing
Indexing
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDB
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performance
 
How to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better PerformanceHow to Analyze and Tune MySQL Queries for Better Performance
How to Analyze and Tune MySQL Queries for Better Performance
 
Mongo Nosql CRUD Operations
Mongo Nosql CRUD OperationsMongo Nosql CRUD Operations
Mongo Nosql CRUD Operations
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
Mongo indexes
Mongo indexesMongo indexes
Mongo indexes
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance Tuning
 

Similar to MongoDB Aggregation Performance

MongoDB World 2019: RDBMS Versus MongoDB Aggregation Performance
MongoDB World 2019: RDBMS Versus MongoDB Aggregation PerformanceMongoDB World 2019: RDBMS Versus MongoDB Aggregation Performance
MongoDB World 2019: RDBMS Versus MongoDB Aggregation PerformanceMongoDB
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity PlanningMongoDB
 
Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Ivo Andreev
 
Performance Tipping Points - Hitting Hardware Bottlenecks
Performance Tipping Points - Hitting Hardware BottlenecksPerformance Tipping Points - Hitting Hardware Bottlenecks
Performance Tipping Points - Hitting Hardware BottlenecksMongoDB
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterChris Henry
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters MongoDB
 
Sizing Your MongoDB Cluster
Sizing Your MongoDB ClusterSizing Your MongoDB Cluster
Sizing Your MongoDB ClusterMongoDB
 
MongoDB Internals
MongoDB InternalsMongoDB Internals
MongoDB InternalsSiraj Memon
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at ScaleMongoDB
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalVigyan Jain
 
MongoDB 2.4 and spring data
MongoDB 2.4 and spring dataMongoDB 2.4 and spring data
MongoDB 2.4 and spring dataJimmy Ray
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Marco Tusa
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at nightMichael Yarichuk
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDBMongoDB
 
Storage talk
Storage talkStorage talk
Storage talkchristkv
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Consjohnrjenson
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDBAWS Germany
 

Similar to MongoDB Aggregation Performance (20)

MongoDB World 2019: RDBMS Versus MongoDB Aggregation Performance
MongoDB World 2019: RDBMS Versus MongoDB Aggregation PerformanceMongoDB World 2019: RDBMS Versus MongoDB Aggregation Performance
MongoDB World 2019: RDBMS Versus MongoDB Aggregation Performance
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
 
Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)
 
Performance Tipping Points - Hitting Hardware Bottlenecks
Performance Tipping Points - Hitting Hardware BottlenecksPerformance Tipping Points - Hitting Hardware Bottlenecks
Performance Tipping Points - Hitting Hardware Bottlenecks
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb Cluster
 
MongoDB
MongoDBMongoDB
MongoDB
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
 
Sizing Your MongoDB Cluster
Sizing Your MongoDB ClusterSizing Your MongoDB Cluster
Sizing Your MongoDB Cluster
 
MongoDB Internals
MongoDB InternalsMongoDB Internals
MongoDB Internals
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
 
MongoDB 2.4 and spring data
MongoDB 2.4 and spring dataMongoDB 2.4 and spring data
MongoDB 2.4 and spring data
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
 
Storage talk
Storage talkStorage talk
Storage talk
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
 
Mongodb
MongodbMongodb
Mongodb
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDB
 

More from MongoDB

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump StartMongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB
 

More from MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

Recently uploaded

Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native BuildpacksVish Abrams
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilVICTOR MAESTRE RAMIREZ
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfTobias Schneck
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampVICTOR MAESTRE RAMIREZ
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesShyamsundar Das
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxAutus Cyber Tech
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...OnePlan Solutions
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.Sharon Liu
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxJoão Esperancinha
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmonyelliciumsolutionspun
 
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageSales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageDista
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsJaydeep Chhasatia
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesSoftwareMill
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadIvo Andreev
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024Mind IT Systems
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?AmeliaSmith90
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIIvo Andreev
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionsNirav Modi
 

Recently uploaded (20)

Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native Buildpacks
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-Council
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in Trivandrum
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - Datacamp
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security Challenges
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptx
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptx
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
 
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageSales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
 
Salesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptxSalesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptx
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retries
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and Bad
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AI
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspections
 

MongoDB Aggregation Performance

  • 1. MongoDB Aggregation Performance. John Page – Principal Consulting engineer
  • 2. The Aggregation Framework • What is it? • When should I use it? • What can it do and not do? • When should I use it instead of an RDBMS
  • 3. What is the Aggregation Framework • It’s a data transformation pipeline. • It ultimately a Turing complete functional language. • It’s SELECT AS JOIN GROUP BY HAVING. • It’s a fun challenge to use.
  • 4. What can it do? and not do? • It can read and examine documents and apply logic to them and create new ones. • Technically – it can do almost anything. • Mine Bitcoins. • Learn (in the AI sense). • Emulate / Transpile SQL statements. • Generate graphics. • Run simulations. • It can’t currently edit existing data in place.
  • 5. When should I definitely use it. • When the data’s in MongoDB and you don’t want to copy it. • When you want to report on live data. • When your application operations require more than find()
  • 6. When should I use it versus my RDBMS? • That’s a very good question.
  • 7. Received Wisdom • Conventional wisdom says RDBMS is just better • Optimized for reporting.
  • 8. Let us take a scenario • You have a set of data • You want to Report on it and Analyze it • This data isn’t live – so we don’t need to worry about that. • There may be a lot of it.
  • 9. Our Test Data Set • Large and Meaningful
  • 10. Data Details • Every medical practice in England • 10+ years available month by month • Quantity and cost of each item prescribed and number of scripts. • >100 million+ rows a year
  • 13. The Hardware • Centos 7 – on Amazon EC2 • 32GB RAM • 4 CPU Cores • Databases on 2000 IOPS 400GB Disk • Temp files on 1200 IOPS 400GB Disk
  • 14. In the Blue corner • MySQL Version 8.0 • Out of the box defaults • Cache (innodb_buffer_pool) set to 80% of RAM • 3 Tables (13GB) • Indexing as required
  • 15. In the Green Corner • MongoDB 4.0.3 • Cache set to default (50% - 1GB) • 1 Collection • 15 GB of BSON • 5GB on disk due to Snappy.
  • 17. How much did the UK Spend in 2017? select sum(cost) From prescriptions; { $group : { _id: “all” , t : { $sum : { $sum: "$prescriptions.cost” } } } }
  • 20. The other Result select sum(cost) From prescriptions; { $group : { _id: “all” , t : { $sum : { $sum: "$prescriptions.cost” } } } } 37 Seconds 54 Seconds 25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
  • 21. Row Format RDBMS • Fixed length values • Known column offsets • Fast to find COLUMN. • Fast to next ROW. • Expensive to change . MongoDB • Dynamic documents • Traverse from start • Known sizes • Depth Matters • More flexibility 1 Bob 3.5 18-7-1972 NULL 2 Sally 8.9 15-3-1984 “Magic”
  • 22. Row Format RDBMS • Fixed length values • Known column offsets • Fast to find COLUMN. • Fast to next ROW. • Expensive to change . MongoDB • Dynamic documents • Traverse from start • Known sizes _id:int 1 name: str(3) “bob” size:double 3.5 when: date 18-7-1972
  • 23. Row Format RDBMS • Fixed length values • Known column offsets • Fast to find COLUMN. • Fast to next ROW. • Expensive to change . MongoDB • Dynamic documents • Traverse from start • Known sizes • Hierarchy Matters • Much more flexibility _id:int 1 name: str(3) bob sizes: array(256) [ double 3.5. double 10. double 1.2, double 99] when: date 18-7-1972
  • 24. What about an Index? select sum(cost) From prescriptions; { $group : { _id: “all” , t : { $sum : { $sum: "$prescriptions.cost” } } } } 21 Seconds (vs. 37) 54 Seconds 25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s Apply Covering Index Here
  • 25. But can’t MongoDB index cover too? • Yes, but not when it’s a Multikey (array) index • We only index unique values index only once • So the index cannot recreate the array
  • 26. Can we fix that? • What if we flatten the data? • Lots of redundancy • Collection is now 200% larger • Normalisation? db.prescriptions.aggregate([ {$unwind:”$prescriptions”}, {$project:{_id:0}}, {$out: “tabular”} ]) db.tabular.createIndex({‘prescriptions.cost’:1})
  • 27. Flat, wide data. • 110 M Rows • 51 GB as BSON • 15 GB Compressed • Not really tabular. • 860 MB Index • Prefix Compression • Super space efficient
  • 28. Query Performance when flattened. MongoDB No covering Index MongoDB With covering Index 509 Seconds (vs 54) 509 Seconds 6% CPU 1700IOPS 30MB/s 6% CPU 1700IOPS 30MB/s
  • 29. Query Performance when flattened. • That doesn’t look right. MongoDB No Index Index 509 Seconds (vs 54) 509 Seconds 6% CPU 1700IOPS 30MB/s 6% CPU 1700IOPS 30MB/s "queryPlanner" : { "winningPlan" : { "stage" : "COLLSCAN”, "direction" : "forward” }}
  • 30. Flat, wide data. • Need to persuade aggregation to use the index • Add a query ( cost > 0) or sort by cost at the start • Still slower than document model ? • Document model is efficient. • This data is actually MOST of the database 110M Entries • Imagine if our index was a small percentage of the data. • Index compression has a cost when reading. No Index Index 509 Seconds (vs 54) 177 Seconds (vs 54) 6% CPU 1700IOPS 30MB/s 25% CPU 0 IOPS 0MB/s
  • 31. Table Layout RDBMS • Lots of fixed size rows in a file • Nice predictable layout MongoDB • Variable Length rows in a file • Less predictable layout
  • 32. Table Layout – The Truth • RDBMS and MongoDB both store records in Trees • Records are in some ways, just like indexes.
  • 33. Table Layout – The Truth RDBMS • Rows held in Balanced Tree • This IS the Primary Key • Linked leaves MongoDB • Docs in Balanced Tree • Index on identity • Can only walk the tree • Slower to collection scan • Less lock contention
  • 34. Table Layout – The Truth RDBMS • Rows held in Balanced Tree • This IS the Primary Key • Linked leaves MongoDB • Docs in Balanced Tree • Organised by Identity (int64) • No links between leaves • Slower to scan everything • Much less lock contention
  • 35. In-Document rollup. • We have multiple data items in each document. • Add summaries of cost in each document? • No cost when updating anyway $max,$min,$sum,$count. • RDBMS equivalent has big cost. • You need to know in advance, or add as needed • Like an RDBMS index • What if we index the in-document rollup?
  • 36. MongoDB with in document roll-up. No Index on IDI Index on IDI 18 Seconds (Versus 54, or 21 in RDBMS) 18 Seconds 25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0MB/s
  • 37. MongoDB with in document roll-up. No Index on IDI Index on IDI 18 Seconds (Versus 54, or 21 in RDBMS) 0.01 Seconds 25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0MB/s
  • 38. So far… • When Data fits in RAM • RDBMS Table scan faster than Mongo Collection scan • RDBMS Index scan faster than RDMBS Table scan • Large MongoDB Index scan isn’t solution • In document rollups beat RDBMS Index scan • Index scan of in-document rollups is really quick.
  • 40. What if it wasn’t all about the CPU? • Data Lakes and “Big Data” • Limited by reading data from disk • Limited by Parallelism • New Experiment Time. • Reduce RAM to much less than Data Size* • Work with disk bound data. • Still one CPU.
  • 41. Table/Collection scan from Disk select sum(cost) From prescriptions; { $group : { _id: “all” , t : { $sum : { $sum: "$prescriptions.cost” } } } } 157 Seconds 61 Seconds 3.5% CPU 1253 IOPS 103 MB/s 25% CPU 650 IOPS 103 MB/s
  • 42. Why is MongoDB faster MySQL • Data Size = 15 GB MongoDB • Data Size = 5GB* • Minimal decompression overhead *Not ‘Big’ Data
  • 43. Index scan from Disk select sum(cost) From prescriptions; { $group : { _id: “all” , t : { $sum : { $sum: "$prescriptions.cost” } } } } 41 Seconds 61 Seconds 25% CPU 1020 IOPS 103 MB/s 25% CPU 650 IOPS 103 MB/s Add an Index
  • 45. More Complex Queries From RAM • May still use Disk for temp tables, Storage etc. • All Tables and Indexes fit in RAM From DISK Data does NOT fit in RAM Some indexes MAY be in RAM No indexes used for MongoDB
  • 46. With Group BY (RAM) select sum(cost) from prescriptions group by period; { $group : { _id: "$period", t : {$sum : { $sum: "$prescriptions.cost"} }}} 63 Seconds 60 Seconds 25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s 30 Seconds (Index) 25% CPU 0 IOPS 0 MB/s
  • 47. With Group BY (Disk) select sum(cost) from prescriptions group by period; { $group : { _id: "$period", t : {$sum : { $sum: "$prescriptions.cost"} }}} 39 Seconds (index) 18% CPU 1010 IOPS 103 MB/s 160 Seconds 63 Seconds 10% CPU 1020 IOPS 103 MB/s 24% CPU 650 IOPS 82 MB/s
  • 48. Top 10 practices by spend (RAM) SELECT practice, SUM(cost) AS totalspend FROM prescriptions GROUP BY practice ORDER BY totalspend DESC LIMIT 10; [ { $group: { _id: "$practice”, spend: { $sum: { $sum: "$prescriptions.cost"}}}}, { $sort: { $spend: -1}}, { $limit: 10} ] 51 Seconds (index) 18% CPU 1010 IOPS 103 MB/s 75 Seconds 63 Seconds 10% CPU 1020 IOPS 103 MB/s 24% CPU 650 IOPS 82 MB/s
  • 49. Top 10 practices by spend (Disk) SELECT practice, SUM(cost) AS totalspend FROM prescriptions GROUP BY practice ORDER BY totalspend DESC LIMIT 10; [ { $group: { _id: "$practice”, spend: { $sum: { $sum: "$prescriptions.cost"}}}}, { $sort: { $spend: -1}}, { $limit: 10} ] 160 Seconds 64 Seconds 10% CPU 1150 IOPS 104 MB/s 25% CPU 650 IOPS 82 MB/s 55 Seconds (indexed) 21% CPU 724 IOPS 77 MB/s
  • 50. £ per patient – JOIN and Group (RAM) SELECT practice, SUM(cost / numpatients) AS totalspend, AVG(numpatients) FROM nhs.prescriptions pr, nhs.patientcounts pc WHERE pr.practice = pc.code GROUP BY practice ORDER BY totalspend DESC LIMIT 10; { "$group" : { "_id”:"$practice", "perpatient": {"$sum": {"$divide": [{"$sum”:"$prescriptions.cost"}, "$numpatients"] } }}}, { "$sort": {"perpatient":-1}, { "$limit": 10} 105 Seconds 62 Seconds 25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
  • 51. £ per patient – JOIN and Group (Disk) SELECT practice, SUM(cost / numpatients) AS totalspend, AVG(numpatients) FROM nhs.prescriptions pr, nhs.patientcounts pc WHERE pr.practice = pc.code GROUP BY practice ORDER BY totalspend DESC LIMIT 10; { "$group" : { "_id”:"$practice", "perpatient": {"$sum": {"$divide": [{"$sum”:"$prescriptions.cost"}, "$numpatients"] } }}}, { "$sort": {"perpatient":-1}, { "$limit": 10} 160 Seconds 62 Seconds 17% CPU 1200 IOPS 103 MB/s 24% CPU 650 IOPS 82 MB/s
  • 52. £/patient/county – nested SELECT (RAM) select county,sum(totalcost) as spend,sum(patients) as patients,sum(totalcost)/sum(patients) as costperpatient from (select county,sum(cost) as totalcost, avg(numpatients) as patients from prescriptions pr,patientcounts pc,practices pa where pr.practice=pc.code and pr.practice=pa.code and pa.period=pr.period group by county,practice) as byprac group by county having patients > 100000 order by costperpatient desc limit 20; db.prescriptions.aggregate([ {"$group" : {"_id" : { "county": "$address.county”, "practice": "$practice"},"spend" : { "$sum" : {"$sum" : "$prescriptions.cost"}}, "numpatients" : { "$avg" : "$numpatients"}}}, { "$group": { "_id" : "$_id.county", "spend" :{ "$sum" : "$spend" }, "patients" : {"$sum": "$patients"}}}, {"$addFields" : { "costperpatient" : { "$divide” :["$spend","$patients"] }}}, {"$match" : { "numpatients" : { "$gt" : 100000}}}, {"$sort" : { "costperpatient" : -1}} ,{$limit:20} ]) 160 Seconds 66 Seconds 24% 0IOPS 0IOPS 24% CPU 0 IOPS 0 MBs
  • 53. Result Spend (£M) Patients Per Patient(£) LINCOLNSHIRE 122 699309 175 WIRRAL 25 149554 172 CO DURHAM 102 596638 171 CLEVELAND 75 462593 163 ISLE OF WIGHT 25 144555 162
  • 54. Result Spend (£M) Patients Per Patient(£) LINCOLNSHIRE 122 699309 175 WIRRAL 25 149554 172 CO DURHAM 102 596638 171 CLEVELAND 75 462593 163 ISLE OF WIGHT 25 144555 162 Spend (£M) Patients Per Patient(£) BERKSHIRE 102 944538 108 MIDDLESEX 150 1469189 102 BRISTOL 14 145660 97 LONDON 522 5672564 92 LEEDS 9 122785 74
  • 55. £/patient/county – nested SELECT (Disk) select county,sum(totalcost) as spend,sum(patients) as patients,sum(totalcost)/sum(patients) as costperpatient from (select county,sum(cost) as totalcost, avg(numpatients) as patients from prescriptions pr,patientcounts pc,practices pa where pr.practice=pc.code and pr.practice=pa.code and pa.period=pr.period group by county,practice) as byprac group by county having patients > 100000 order by costperpatient desc limit 20; db.prescriptions.aggregate([ {"$group" : {"_id" : { "county": "$address.county”, "practice": "$practice"},"spend" : { "$sum" : {"$sum" : "$prescriptions.cost"}}, "numpatients" : { "$avg" : "$numpatients"}}}, { "$group": { "_id" : "$_id.county", "spend" :{ "$sum" : "$spend" }, "patients" : {"$sum": "$patients"}}}, {"$addFields" : { "costperpatient" : { "$divide” :["$spend","$patients"] }}}, {"$match" : { "numpatients" : { "$gt" : 100000}}}, {"$sort" : { "costperpatient" : -1}} ,{$limit:20} ]) 220 Seconds 67 Seconds 21% CPU 700 IOPS 68 MB/s 23% CPU 640 IOPS 82 MB/s
  • 56. Most common drugs – REGROUP (RAM) select bnfcode,max(name), sum(nitems) as items from nhs.prescriptions group by bnfcode order by items desc limit 10; { "$unwind":"$prescriptions"}, {"$group" : _id:‘$prescriptions.bnfcode’, name:{$max:’$prescriptions.name’}, items:{$sum:’$prescriptions.nitems’}}}, {"$sort" : { "items" : -1}}, {"$limit":10}] 300 Seconds 262 Seconds 23% CPU 0 IOPS 0MB/s 25% CPU 0 IOPS 0MB/s 126 Seconds (Indexed) 25% CPU 0 IOPS 0MB/S
  • 57. Grouping Techniques SQL Can take advantage of index ordering by group key, all items with same key come together so can process one at a time. 1,1,1,1,1,2,2,2,2,3,3 Uses a temp table and sort when it can’t. MongoDB Does not take advantage of ordering (yet) – maintains a data structure with all values. Assumed you will want to group further down the pipeline so optimised for that. Builds a tree (using disk) for the values.
  • 58. Result Omeprazole_Cap E/C 20mg Acid Reflux 23,292,184 Aspirin Disper_Tab 75mg 15,361,735 Paracet_Tab 500mg 14,562,514 Amlodipine_Tab 5mg Blood Pressure 14,416,079 Atorvastatin_Tab 20mg Cholesterol 13,152,079 Lansoprazole_Cap 30mg (E/C Gran) Acid Reflux 12,906,650 Simvastatin_Tab 40mg Cholesterol 12,760,343 Metformin HCl_Tab 500mg Diabetes 11,404,331 Salbutamol_Inha 100mcg (200 D) C Asthma 10,595,100 Levothyrox Sod_Tab 100mcg Thyroid 9,312,464
  • 59. Most common drugs – REGROUP (Disk) select bnfcode,max(name), sum(nitems) as items from nhs.prescriptions group by bnfcode order by items desc limit 10; { "$unwind":"$prescriptions"}, {"$group" : _id:‘$prescriptions.bnfcode’, name:{$max:’$prescriptions.name’}, items:{$sum:’$prescriptions.nitems’}}}, {"$sort" : { "items" : -1}}, {"$limit":10}] 1427 Seconds 262 Seconds 4% CPU 1800 IOPS 100 MB/S 24% CPU 180 IOPS 23 MB/s 192 Seconds (Index) 13% CPU 520 IOPS 56MB/s
  • 60. Most Expensive Drugs - Result Rivaroxaban_Tab 20mg Anti Coagulent £100,007,025 Apixaban_Tab 5mg Anti Coagulent £79,302,385 Fostair_Inh 100mcg/6mcg (120D) C Asthma £75,541,726 Tiotropium_Pdr For Inh Cap 18mcg COPD £66,348,167 Sitagliptin_Tab 100mg Diabetes £60,919,725 Symbicort_Turbohaler 200mcg/6mcg Asthma £44,314,887 Apixaban_Tab 2.5mg Anti Coagulent £41,290,937 Ins Lantus SoloStar_100u/ml 3ml Diabetes £41,182,602 Ezetimibe_Tab 10mg Cholesterol £40,756,234 Linagliptin_Tab 5mg Diabetes £38,503,893
  • 61. Anomaly Detection – JOIN Derived (RAM) SELECT prescriptions.bnfcode,MAX(prescriptions.name), prescriptions.practice,MAX(practices.name), AVG(nitems),AVG(patientcounts.numpatients), AVG(aveperperson),AVG((nitems / patientcounts.numpatients) / aveperperson) AS ratio FROM prescriptions LEFT JOIN (SELECT bnfcode, AVG(nitems / numpatients) AS aveperperson FROM prescriptions, patientcounts WHERE prescriptions.practice = patientcounts.code GROUP BY bnfcode) AS avgs ON avgs.bnfcode = prescriptions.bnfcode LEFT JOIN patientcounts ON prescriptions.practice = patientcounts.code LEFT JOIN practices ON practices.code = prescriptions.practice WHERE patientcounts.numpatients > 500 AND aveperperson > 0 AND prescriptions.practice NOT IN ('Y01924') GROUP BY prescriptions.practice,prescriptions.bnfcode ORDER BY ratio DESC LIMIT 10; db.prescriptions.aggregate([{"$unwind":"$prescriptions"},{"$group":{"_id ":"$prescriptions.bnfcode","aveperperson":{"$avg":{"$divide":["$prescrip tions.nitems","$numpatients"]}}}},{"$match":{"aveperperson":{"$ne":null} }},{"$out":"typical"}]) db.prescriptions.aggregate([{"$match":{"numpatients":{"$gt":500},"actice ":{"$nin":["Y01924"]}}},{"$unwind":"$prescriptions"},{"$group":{"_id":{" bnfcode":"$prescriptions.bnfcode","practice":"$practice"},"name":{"$max" :"$prescriptions.name"},"pracicename":{"$max":"$address.name"},"nitems": {"$sum":"$prescriptions.nitems"},"numpatients":{"$max":"$numpatients"}," nmonths":{"$sum":1}}},{"$addFields":{"perpatient":{"$divide":[{"$divide" :["$nitems","$nmonths"]},"$numpatients"]}}},{"$lookup":{"from":"typical" ,"localField":"_id.bnfcode","foreignField":"_id","as":"typical"}},{"$unw ind":"$typical"},{"$addFields":{"ratio":{"$divide":["$perpatient","$typi cal.aveperperson"]}}},{"$sort":{"ratio":-1}},{"$limit":10}])
  • 62. Anomaly Detection – JOIN Derived (RAM) SELECT prescriptions.bnfcode,MAX(prescriptions.name), prescriptions.practice,MAX(practices.name), AVG(nitems),AVG(patientcounts.numpatients), AVG(aveperperson),AVG((nitems / patientcounts.numpatients) / aveperperson) AS ratio FROM prescriptions LEFT JOIN (SELECT bnfcode, AVG(nitems / numpatients) AS aveperperson FROM prescriptions, patientcounts WHERE prescriptions.practice = patientcounts.code GROUP BY bnfcode) AS avgs ON avgs.bnfcode = prescriptions.bnfcode LEFT JOIN patientcounts ON prescriptions.practice = patientcounts.code LEFT JOIN practices ON practices.code = prescriptions.practice WHERE patientcounts.numpatients > 500 AND aveperperson > 0 AND prescriptions.practice NOT IN ('Y01924') GROUP BY prescriptions.practice,prescriptions.bnfcode ORDER BY ratio DESC LIMIT 10; db.prescriptions.aggregate([{"$unwind":"$prescriptions"},{"$group":{"_id ":"$prescriptions.bnfcode","aveperperson":{"$avg":{"$divide":["$prescrip tions.nitems","$numpatients"]}}}},{"$match":{"aveperperson":{"$ne":null} }},{"$out":"typical"}]) db.prescriptions.aggregate([{"$match":{"numpatients":{"$gt":500},"actice ":{"$nin":["Y01924"]}}},{"$unwind":"$prescriptions"},{"$group":{"_id":{" bnfcode":"$prescriptions.bnfcode","practice":"$practice"},"name":{"$max" :"$prescriptions.name"},"pracicename":{"$max":"$address.name"},"nitems": {"$sum":"$prescriptions.nitems"},"numpatients":{"$max":"$numpatients"}," nmonths":{"$sum":1}}},{"$addFields":{"perpatient":{"$divide":[{"$divide" :["$nitems","$nmonths"]},"$numpatients"]}}},{"$lookup":{"from":"typical" ,"localField":"_id.bnfcode","foreignField":"_id","as":"typical"}},{"$unw ind":"$typical"},{"$addFields":{"ratio":{"$divide":["$perpatient","$typi cal.aveperperson"]}}},{"$sort":{"ratio":-1}},{"$limit":10}]) 2250 Seconds 1489 Seconds 24% CPU 1020 IOPS 90MB/s 25% CPU 0 IOPS 0MB/s
  • 63. Results DRUG PRACTICE RATIO Methadone FULCRUM 297 Rehab Trazodone CARE HOMES MEDICAL 242 Elderly Care Buprenorphine FULCRUM 233 Thickenup PDR CARE HOMES MEDICAL 174 Vitrex Nitrile Gloves REETH MEDICAL 173 Preference? Ema Film Gloves NEW SPrintwells1 168 Pro D3 Cap PALFREY HEALTH CENTRE 123 Vitamin D? Fultium D3 Cap MOHANTY 123 Vitamin D
  • 64. Results DRUG PRACTICE RATIO Oxycodone Merseyside GP 108 Tamiflu Lancashire GP 56 Nicotine_Transdermal Dorset GP 33 Loperamide (Immodium) Kent GP 16
  • 65. Anomaly Detection – JOIN Derived (Disk) SELECT prescriptions.bnfcode,MAX(prescriptions.name), prescriptions.practice,MAX(practices.name), AVG(nitems),AVG(patientcounts.numpatients), AVG(aveperperson),AVG((nitems / patientcounts.numpatients) / aveperperson) AS ratio FROM prescriptions LEFT JOIN (SELECT bnfcode, AVG(nitems / numpatients) AS aveperperson FROM prescriptions, patientcounts WHERE prescriptions.practice = patientcounts.code GROUP BY bnfcode) AS avgs ON avgs.bnfcode = prescriptions.bnfcode LEFT JOIN patientcounts ON prescriptions.practice = patientcounts.code LEFT JOIN practices ON practices.code = prescriptions.practice WHERE patientcounts.numpatients > 500 AND aveperperson > 0 AND prescriptions.practice NOT IN ('Y01924') GROUP BY prescriptions.practice,prescriptions.bnfcode ORDER BY ratio DESC LIMIT 10; db.prescriptions.aggregate([{"$unwind":"$prescriptions"},{"$group":{"_id ":"$prescriptions.bnfcode","aveperperson":{"$avg":{"$divide":["$prescrip tions.nitems","$numpatients"]}}}},{"$match":{"aveperperson":{"$ne":null} }},{"$out":"typical"}]) db.prescriptions.aggregate([{"$match":{"numpatients":{"$gt":500},"actice ":{"$nin":["Y01924"]}}},{"$unwind":"$prescriptions"},{"$group":{"_id":{" bnfcode":"$prescriptions.bnfcode","practice":"$practice"},"name":{"$max" :"$prescriptions.name"},"pracicename":{"$max":"$address.name"},"nitems": {"$sum":"$prescriptions.nitems"},"numpatients":{"$max":"$numpatients"}," nmonths":{"$sum":1}}},{"$addFields":{"perpatient":{"$divide":[{"$divide" :["$nitems","$nmonths"]},"$numpatients"]}}},{"$lookup":{"from":"typical" ,"localField":"_id.bnfcode","foreignField":"_id","as":"typical"}},{"$unw ind":"$typical"},{"$addFields":{"ratio":{"$divide":["$perpatient","$typi cal.aveperperson"]}}},{"$sort":{"ratio":-1}},{"$limit":10}]) 7848 Seconds 1655 Seconds 24% CPU 1020 IOPS 90MB/s 25% CPU 170 IOPS 24MB/s
  • 66. Conclusions • MongoDB is faster from disk when there are no indexes • MongoDB is generally faster for more complex queries • MongoDB fits the data-lake model nicely.
  • 67. Is there a SQL
  • 68. BI Connector • Translating Proxy Server • SQL -> MySQL Wire Protocol -> Bi Connector -> MongoDB Aggregation
  • 69. BI Connector • Total Spend. • Sum one column SQL Unindexed SQL Indexed MongoDB Aggregatio on BI Connector Disk 157 41 61 RAM 37 21 54 150
  • 70. Why slower ? { $group : { _id: “all” , t : { $sum : { $sum: "$prescriptions.cost” } } } } {"$unwind":{"includeArrayIndex":"prescr iptions_idx","path":"$prescriptions"}}, {"$group":{"_id":{},"sum(nhs_DOT_presc riptions_DOT_cost)":{"$sum":"$cost"},"s um(nhs_DOT_prescriptions_DOT_cost)_coun t":{"$sum":{"$cond":[{"$eq":[{"$ifNull" :["$cost",null]},null]},0,1]}}}}, {"$project":{"_id":0,"sum(nhs_DOT_pres criptions_DOT_cost)":{"$let":{"in":{"$c ond":[{"$or":[{"$eq":[{"$ifNull":["$$ex pr",null]},null]},{"$eq":["$$expr",0]}, {"$eq":["$$expr",false]}]},{"$literal": null},"$sum(nhs_DOT_prescriptions_DOT_c ost)"]},"vars":{"expr":"$sum(nhs_DOT_pr escriptions_DOT_cost)_count"}}}}}, {"$project":{"nhs_DOT_sum(nhs_DOT_pres criptions_DOT_cost)":"$sum(nhs_DOT_pres criptions_DOT_cost)","_id":0}}
  • 71. Why so much slower - Unwind { $group : { _id: “all” , t : { $sum : { $sum: "$prescriptions.cost” } } } } {"$unwind":{"includeArrayIndex":"prescr iptions_idx","path":"$prescriptions"}}, {"$group":{"_id":{},"sum(nhs_DOT_presc riptions_DOT_cost)":{"$sum":"$cost"},"s um(nhs_DOT_prescriptions_DOT_cost)_coun t":{"$sum":{"$cond":[{"$eq":[{"$ifNull" :["$cost",null]},null]},0,1]}}}}, {"$project":{"_id":0,"sum(nhs_DOT_pres criptions_DOT_cost)":{"$let":{"in":{"$c ond":[{"$or":[{"$eq":[{"$ifNull":["$$ex pr",null]},null]},{"$eq":["$$expr",0]}, {"$eq":["$$expr",false]}]},{"$literal": null},"$sum(nhs_DOT_prescriptions_DOT_c ost)"]},"vars":{"expr":"$sum(nhs_DOT_pr escriptions_DOT_cost)_count"}}}}}, {"$project":{"nhs_DOT_sum(nhs_DOT_pres criptions_DOT_cost)":"$sum(nhs_DOT_pres criptions_DOT_cost)","_id":0}}
  • 72. Why so much slower – SQL Standard semantics{ $group : { _id: “all” , t : { $sum : { $sum: "$prescriptions.cost” } } } } $sum : [ NULL, NULL, NULL ] = 0 SUM [NULL,NULL,NULL] = NULL {"$unwind":{"includeArrayIndex":"prescr iptions_idx","path":"$prescriptions"}}, {"$group":{"_id":{},"sum(nhs_DOT_presc riptions_DOT_cost)":{"$sum":"$cost"},"s um(nhs_DOT_prescriptions_DOT_cost)_coun t":{"$sum":{"$cond":[{"$eq":[{"$ifNull" :["$cost",null]},null]},0,1]}}}}, {"$project":{"_id":0,"sum(nhs_DOT_pres criptions_DOT_cost)":{"$let":{"in":{"$c ond":[{"$or":[{"$eq":[{"$ifNull":["$$ex pr",null]},null]},{"$eq":["$$expr",0]}, {"$eq":["$$expr",false]}]},{"$literal": null},"$sum(nhs_DOT_prescriptions_DOT_c ost)"]},"vars":{"expr":"$sum(nhs_DOT_pr escriptions_DOT_cost)_count"}}}}}, {"$project":{"nhs_DOT_sum(nhs_DOT_pres criptions_DOT_cost)":"$sum(nhs_DOT_pres criptions_DOT_cost)","_id":0}}
  • 73. BI Connector • Total Spend By Period. • Sum one column Group by Primary Key SQL Unindexed SQL Indexed MongoDB Aggregatio on BI Connector Disk 160 39 63 RAM 63 30 60 187
  • 74. BI Connector • Total Spend By PRACTICE. • Sum one column Group by PART OF KEY SQL Unindexed SQL Indexed MongoDB Aggregatio on BI Connector Disk 160 55 62 RAM 75 51 62 230
  • 75. BI Connector • Total Spend By PATIENT. • Sum one column Group BY JOINED FIELD SQL Indexed MongoDB Aggregatio on BI Connector Disk 160 62 RAM 105 62 230
  • 76. BI Connector • AVG Spend By SPEND PER PATIENT PER COUNTY . • Sum one column Group BY COMPUTED FIELD SQL Indexed MongoDB Aggregatio on BI Connector Disk 220 67 RAM 160 66 220
  • 77. BI Connector • MOST prescribed drugs. • Sum one column Group BY Not PK SQL Unindexed SQL Indexed MongoDB Aggregatio on BI Connector Disk 1427 192 262 RAM 300 126 262 280
  • 78. BI Connector • Anamoly Detection. • Sum one column Group from subquery Joined table SQL Indexed MongoDB Aggregatio on BI Connector Disk 7848 1655 RAM 2250 1489 D.N.F !
  • 79. BI Connector – Did Not Finish – why? • Anomaly Detection. • Did Run – but was taking a long time • Using expressive $lookup for the join $lookup: { from: ”prescriptions”, let: { drug : “aspirin” }, pipeline: [ group total by drugname,practice, divide by practise size, group by drugname, match $$drug ], as: <output array field> }
  • 80. BI Connector – why Did Not Finish • Anomaly Detection. • Did run – but was taking a long time • Using expressive $lookup for the join • No Index on in-memory table • Hand written version • used temp collection • Made use _id was lookup field $lookup: { from: ”prescriptions”, let: { drug : “aspirin” }, pipeline: [ group total by drugname,practice, divide by practise size, group by drugname, match $$drug ], as: <output array field> }
  • 81. Conclusions • BI Connector a little slower than RDBMS for simple queries • The more complicated the query, the faster it is relatively. • It’s not as quick as hand crafted Aggregation • But you can put that in views. • But it’s very convenient • You can use your existing BI Tooling • And you could always use Charts instead.
  • 82. Is there an Elephant in the room?
  • 83. External compute engines • Spark • Hadoop • R • Python • Java / C
  • 84. Pros • Simpler to write much more complicated processing. • Lots of libraries of pre written code • Able to perform a lot of in-memory computation • MongoDB can send them data very, very quickly
  • 85. Cons • Costs of transferring from or inside clouds • Atlas • AWS • Network speed limitations. • Additional hardware. • Security considerations. AWS Same Region 1 cent / GB AWS Between Regions 9 cents / GB AWS Out to 11 cents / GB
  • 86. So do I use Spa^HR^doop! Or not? • Yes – those tools are great for many things • But always push computation DOWN to MongoDB if you can • There is a balance • Amount of effort to write as a Pipeline • Reduced network costs in time and money
  • 87. Simple Example • Pearsons RHO • Degree of correlation between two numeric lists • Lets compare Lattitude (North vs South) • And Quantity of drug per person • Hypothesis “For some drugs, more is prescribed as you travel, up or down the UK”
  • 88. Step 1 - Geocoding • We need to augment our records with Lat/Long • Download a handy set of postcode centroids • mongoimport --type csv --headerline -d nhs postcodes.csv • Use $lookup and $out to attach to each record and make new collection.
  • 89. Simple Geocoding tidypc = {$addFields: { "address.postcode" : {$rtrim:{input:"$address.postcode"}}}} Geocode = { $lookup :{ from: "postcodes", localField: "address.postcode", foreignField: "Postcode", as: "location" }} firstelement = { $addFields : { location : { $arrayElemAt : [ "$location" , 0 ]} }} choosefields = { $project : { practice:1, numpatients:1, lon: "$location.Longitude", lat: "$location.Latitude","prescriptions.name":1,"prescriptions.nitems":1}}
  • 90. Step 2 – Group by drug • For each drug compute average quantity/10,000 patients per surgery • Group to one record per drug with an array of objects { drug : “Aspirin”, prescribed : [ { where : [ -3.5, 55.2], per10k : 75.4 } … ] }
  • 91. Group by BNF code unwind ={ $unwind:"$prescriptions"} regroup = { $group : { _id : "$prescriptions.name", p : { $push : { where : ["$lon","$lat"] , per10k : { $multiply: [10000, {$divide : [ "$prescriptions.nitems", "$numpatients"]}]}}}}}
  • 92. Step 1 – Pearsons Rho • Compute RHO on Array comparing per10K to latitude.
  • 93. Computing Pearsons RHO sumcolumns = {$addFields : { s : { $reduce : { input : "$p", initialValue: {count:0,suml:0,sumt:0,sumlsquared:0,sumtsquared:0,sumtl:0}, in : { count : { $add : [ "$$value.count", 1]}, suml : { $add : [ "$$value.suml", "$$this.l" ]}, sumt : { $add : [ "$$value.sumt", "$$this.t" ]}, sumlsquared : { $add : [ "$$value.sumlsquared", { $multiply : ["$$this.l" ,"$$this.l"]}]}, sumtsquared : { $add : [ "$$value.sumtsquared", { $multiply : ["$$this.t" ,"$$this.t"]}]}, sumtl : { $add : [ "$$value.sumtl", { $multiply : ["$$this.t" ,"$$this.l"]}]}, } }}}}
  • 94. Computing Pearsons RHO multiply_suml_sumt = { $multiply : [ "$s.suml","$s.sumt"] } multiply_sumtl_count = { $multiply : ["$s.sumtl","$s.count"]} partone = { $subtract : [ multiply_sumtl_count, multiply_suml_sumt ]} multiply_sumlsquared_count = { $multiply : ["$s.sumlsquared","$s.count"]} suml_squared = { $multiply : ["$s.suml","$s.suml"]} subparttwo = { $subtract : [ multiply_sumlsquared_count,suml_squared ]} multiply_sumtsquared_count = { $multiply : ["$s.sumtsquared","$s.count"]} sumt_squared = { $multiply : ["$s.sumt","$s.sumt"]} subpartthree = { $subtract : [ multiply_sumtsquared_count,sumt_squared ]} parttwo = { $sqrt : {$multiply : [ subparttwo,subpartthree ]}} rho = {$addFields : { rho: {$divide : [partone,parttwo]}}}
  • 95. Sort by rho and output MongoDB Enterprise > db.result.find({},{_id:1,rho:1}).sort({rho:-1}).limit(5) { "_id" : "Audmonal_Cap 60mg", "rho" : 0.32690588090961403 } { "_id" : "ExoCream 500g", "rho" : 0.32119819297625635 } { "_id" : "Luventa XL_Cap 24mg", "rho" : 0.2593870002284348 } { "_id" : "Finetest Lite (Reagent)_Strips", "rho" : 0.2518374958339396 } { "_id" : "Campral EC_Tab 333mg", "rho" : 0.24376784724040662 } MongoDB Enterprise > db.result.find({},{_id:1,rho:1}).sort({rho:1}).limit(5) { "_id" : "Ultra Lite 10cm x 4.5m M/Layer C", "rho" : -0.258189560181513 } { "_id" : "Triptorelin Embon_Inj 22.5mg Vl", "rho" : -0.13752453990107172 }
  • 96. Show me that Gradient
  • 97. Show me that Gradient
  • 98. Can I see it on a map
  • 99. Can I see it on a map
  • 100. Conclusions • The Aggregation Framework is fast. • There is no truth to “RDBMS is Just better” • It’s a good choice for non trivial, ad-hoc queries. • It’s a good choice for large data sets • Consider sharding and microsharding. • In a Cloud world – push work to the database • Even with R/SAS/Spark! Etc.

Editor's Notes

  1. Intro – Shard N – Tech heavy – understandable by all Not my usual talk, eviews 40% too easy, 40% too hard – this is for you 20% in middle
  2. Clarify – what if the Live data is in MongoDB and you want to report In place? Copy to RDBMS?
  3. Clarify – what if the Live data is in MongoDB and you want to report In place? Copy to RDBMS?
  4. Clarify – what if the Live data is in MongoDB and you want to report In place? Copy to RDBMS?
  5. Clarify – what if the Live data is in MongoDB and you want to report In place? Copy to RDBMS?
  6. Imagine – all things being equal. You want to build a Data warehouse, or a Data Lake or – you don’t really Know just report on the data you have. How does MongoDB compare to a more traditional approach.
  7. Although you do have to differentiae between OLAP and Warehouseing etc.
  8. Clarify – what if the Live data is in MongoDB and you want to report In place? Copy to RDBMS?
  9. For now we will test with 1 Year.
  10. Talk about my not postgres
  11. Note NO unwinding
  12. Note NO unwinding
  13. Note NO unwinding
  14. Note NO unwinding
  15. Note NO unwinding
  16. Note NO unwinding
  17. Note NO unwinding
  18. Note NO unwinding
  19. Note NO unwinding
  20. Note NO unwinding
  21. Note NO unwinding
  22. Note NO unwinding
  23. Note NO unwinding
  24. Note NO unwinding
  25. Note NO unwinding
  26. Note NO unwinding
  27. Note NO unwinding
  28. Note NO unwinding
  29. Note NO unwinding
  30. Note NO unwinding
  31. PIRATE JOKE
  32. PIRATE JOKE
  33. PIRATE JOKE
  34. PIRATE JOKE
  35. PIRATE JOKE
  36. PIRATE JOKE
  37. PIRATE JOKE
  38. PIRATE JOKE
  39. PIRATE JOKE
  40. PIRATE JOKE