Find out which is faster, SQL or NoSQL, for traditional reporting tasks. Discover how you can optimise MongoDB aggregation pipelines and how to push complex computation down to the database.
2. The Aggregation Framework
• What is it?
• When should I use it?
• What can it do and not do?
• When should I use it instead of an RDBMS
3. What is the Aggregation Framework
• It’s a data transformation pipeline.
• It ultimately a Turing complete functional language.
• It’s SELECT AS JOIN GROUP BY HAVING.
• It’s a fun challenge to use.
4. What can it do? and not do?
• It can read and examine documents and apply logic to them and
create new ones.
• Technically – it can do almost anything.
• Mine Bitcoins.
• Learn (in the AI sense).
• Emulate / Transpile SQL statements.
• Generate graphics.
• Run simulations.
• It can’t currently edit existing data in place.
5. When should I definitely use it.
• When the data’s in MongoDB and you don’t want to copy it.
• When you want to report on live data.
• When your application operations require more than find()
6. When should I use it versus my RDBMS?
• That’s a very good question.
8. Let us take a scenario
• You have a set of data
• You want to Report on it and Analyze it
• This data isn’t live – so we don’t need to worry about that.
• There may be a lot of it.
10. Data Details
• Every medical practice in England
• 10+ years available month by month
• Quantity and cost of each item prescribed and number of scripts.
• >100 million+ rows a year
13. The Hardware
• Centos 7 – on Amazon EC2
• 32GB RAM
• 4 CPU Cores
• Databases on 2000 IOPS 400GB Disk
• Temp files on 1200 IOPS 400GB Disk
14. In the Blue corner
• MySQL Version 8.0
• Out of the box defaults
• Cache (innodb_buffer_pool) set to 80% of RAM
• 3 Tables (13GB)
• Indexing as required
15. In the Green Corner
• MongoDB 4.0.3
• Cache set to default (50% - 1GB)
• 1 Collection
• 15 GB of BSON
• 5GB on disk due to Snappy.
17. How much did the UK Spend in 2017?
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
20. The other Result
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
37 Seconds 54 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
21. Row Format
RDBMS
• Fixed length values
• Known column offsets
• Fast to find COLUMN.
• Fast to next ROW.
• Expensive to change .
MongoDB
• Dynamic documents
• Traverse from start
• Known sizes
• Depth Matters
• More flexibility
1 Bob 3.5 18-7-1972 NULL
2 Sally 8.9 15-3-1984 “Magic”
22. Row Format
RDBMS
• Fixed length values
• Known column offsets
• Fast to find COLUMN.
• Fast to next ROW.
• Expensive to change .
MongoDB
• Dynamic documents
• Traverse from start
• Known sizes
_id:int 1 name: str(3) “bob” size:double 3.5 when: date 18-7-1972
23. Row Format
RDBMS
• Fixed length values
• Known column offsets
• Fast to find COLUMN.
• Fast to next ROW.
• Expensive to change .
MongoDB
• Dynamic documents
• Traverse from start
• Known sizes
• Hierarchy Matters
• Much more flexibility
_id:int 1 name: str(3) bob sizes: array(256) [
double 3.5.
double 10.
double 1.2,
double 99]
when: date 18-7-1972
24. What about an Index?
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
21 Seconds (vs. 37) 54 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
Apply Covering
Index Here
25. But can’t MongoDB index cover too?
• Yes, but not when it’s a Multikey (array) index
• We only index unique values index only once
• So the index cannot recreate the array
26. Can we fix that?
• What if we flatten the data?
• Lots of redundancy
• Collection is now 200% larger
• Normalisation?
db.prescriptions.aggregate([
{$unwind:”$prescriptions”},
{$project:{_id:0}},
{$out: “tabular”}
])
db.tabular.createIndex({‘prescriptions.cost’:1})
27. Flat, wide data.
• 110 M Rows
• 51 GB as BSON
• 15 GB Compressed
• Not really tabular.
• 860 MB Index
• Prefix Compression
• Super space efficient
28. Query Performance when flattened.
MongoDB No covering Index MongoDB With covering Index
509 Seconds (vs 54) 509 Seconds
6% CPU 1700IOPS 30MB/s 6% CPU 1700IOPS 30MB/s
29. Query Performance when flattened.
• That doesn’t look right.
MongoDB No Index Index
509 Seconds (vs 54) 509 Seconds
6% CPU 1700IOPS 30MB/s 6% CPU 1700IOPS 30MB/s
"queryPlanner" : {
"winningPlan" : {
"stage" : "COLLSCAN”,
"direction" : "forward”
}}
30. Flat, wide data.
• Need to persuade aggregation to use the index
• Add a query ( cost > 0) or sort by cost at the start
• Still slower than document model ?
• Document model is efficient.
• This data is actually MOST of the database 110M Entries
• Imagine if our index was a small percentage of the data.
• Index compression has a cost when reading.
No Index Index
509 Seconds (vs 54) 177 Seconds (vs 54)
6% CPU 1700IOPS 30MB/s 25% CPU 0 IOPS 0MB/s
31. Table Layout
RDBMS
• Lots of fixed size rows in a file
• Nice predictable layout
MongoDB
• Variable Length rows in a file
• Less predictable layout
32. Table Layout – The Truth
• RDBMS and MongoDB both store records in Trees
• Records are in some ways, just like indexes.
33. Table Layout – The Truth
RDBMS
• Rows held in Balanced Tree
• This IS the Primary Key
• Linked leaves
MongoDB
• Docs in Balanced Tree
• Index on identity
• Can only walk the tree
• Slower to collection scan
• Less lock contention
34. Table Layout – The Truth
RDBMS
• Rows held in Balanced Tree
• This IS the Primary Key
• Linked leaves
MongoDB
• Docs in Balanced Tree
• Organised by Identity (int64)
• No links between leaves
• Slower to scan everything
• Much less lock contention
35. In-Document rollup.
• We have multiple data items in each document.
• Add summaries of cost in each document?
• No cost when updating anyway $max,$min,$sum,$count.
• RDBMS equivalent has big cost.
• You need to know in advance, or add as needed
• Like an RDBMS index
• What if we index the in-document rollup?
36. MongoDB with in document roll-up.
No Index on IDI Index on IDI
18 Seconds (Versus 54, or 21 in RDBMS) 18 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0MB/s
37. MongoDB with in document roll-up.
No Index on IDI Index on IDI
18 Seconds (Versus 54, or 21 in RDBMS) 0.01 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0MB/s
38. So far…
• When Data fits in RAM
• RDBMS Table scan faster than Mongo Collection scan
• RDBMS Index scan faster than RDMBS Table scan
• Large MongoDB Index scan isn’t solution
• In document rollups beat RDBMS Index scan
• Index scan of in-document rollups is really quick.
40. What if it wasn’t all about the CPU?
• Data Lakes and “Big Data”
• Limited by reading data from disk
• Limited by Parallelism
• New Experiment Time.
• Reduce RAM to much less than Data Size*
• Work with disk bound data.
• Still one CPU.
41. Table/Collection scan from Disk
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
157 Seconds 61 Seconds
3.5% CPU 1253 IOPS 103 MB/s 25% CPU 650 IOPS 103 MB/s
42. Why is MongoDB faster
MySQL
• Data Size = 15 GB
MongoDB
• Data Size = 5GB*
• Minimal decompression overhead
*Not ‘Big’ Data
43. Index scan from Disk
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}
41 Seconds 61 Seconds
25% CPU 1020 IOPS 103 MB/s 25% CPU 650 IOPS 103 MB/s
Add an
Index
45. More Complex Queries
From RAM
• May still use Disk for temp tables,
Storage etc.
• All Tables and Indexes fit in RAM
From DISK
Data does NOT fit in RAM
Some indexes MAY be in RAM
No indexes used for MongoDB
46. With Group BY (RAM)
select sum(cost)
from prescriptions
group by period;
{ $group : {
_id: "$period",
t : {$sum :
{ $sum:
"$prescriptions.cost"}
}}}
63 Seconds 60 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
30 Seconds (Index)
25% CPU 0 IOPS 0 MB/s
47. With Group BY (Disk)
select sum(cost)
from prescriptions
group by period;
{ $group : {
_id: "$period",
t : {$sum :
{ $sum:
"$prescriptions.cost"}
}}}
39 Seconds (index)
18% CPU 1010 IOPS 103 MB/s
160 Seconds 63 Seconds
10% CPU 1020 IOPS 103 MB/s 24% CPU 650 IOPS 82 MB/s
48. Top 10 practices by spend (RAM)
SELECT
practice, SUM(cost) AS totalspend
FROM
prescriptions
GROUP BY practice
ORDER BY totalspend DESC
LIMIT 10;
[
{ $group: { _id: "$practice”,
spend: { $sum:
{ $sum:
"$prescriptions.cost"}}}},
{ $sort: { $spend: -1}},
{ $limit: 10}
]
51 Seconds (index)
18% CPU 1010 IOPS 103 MB/s
75 Seconds 63 Seconds
10% CPU 1020 IOPS 103 MB/s 24% CPU 650 IOPS 82 MB/s
49. Top 10 practices by spend (Disk)
SELECT
practice, SUM(cost) AS totalspend
FROM
prescriptions
GROUP BY practice
ORDER BY totalspend DESC
LIMIT 10;
[
{ $group: { _id: "$practice”,
spend: { $sum:
{ $sum:
"$prescriptions.cost"}}}},
{ $sort: { $spend: -1}},
{ $limit: 10}
]
160 Seconds 64 Seconds
10% CPU 1150 IOPS 104 MB/s 25% CPU 650 IOPS 82 MB/s
55 Seconds (indexed)
21% CPU 724 IOPS 77 MB/s
50. £ per patient – JOIN and Group (RAM)
SELECT
practice,
SUM(cost / numpatients) AS
totalspend, AVG(numpatients)
FROM
nhs.prescriptions pr,
nhs.patientcounts pc
WHERE
pr.practice = pc.code
GROUP BY practice
ORDER BY totalspend DESC LIMIT 10;
{ "$group" : { "_id”:"$practice",
"perpatient": {"$sum":
{"$divide":
[{"$sum”:"$prescriptions.cost"},
"$numpatients"]
}
}}},
{ "$sort": {"perpatient":-1},
{ "$limit": 10}
105 Seconds 62 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s
51. £ per patient – JOIN and Group (Disk)
SELECT
practice,
SUM(cost / numpatients) AS
totalspend, AVG(numpatients)
FROM
nhs.prescriptions pr,
nhs.patientcounts pc
WHERE
pr.practice = pc.code
GROUP BY practice
ORDER BY totalspend DESC LIMIT 10;
{ "$group" : { "_id”:"$practice",
"perpatient": {"$sum":
{"$divide":
[{"$sum”:"$prescriptions.cost"},
"$numpatients"]
}
}}},
{ "$sort": {"perpatient":-1},
{ "$limit": 10}
160 Seconds 62 Seconds
17% CPU 1200 IOPS 103 MB/s 24% CPU 650 IOPS 82 MB/s
52. £/patient/county – nested SELECT (RAM)
select county,sum(totalcost) as spend,sum(patients) as
patients,sum(totalcost)/sum(patients) as costperpatient
from
(select county,sum(cost) as totalcost, avg(numpatients)
as patients
from prescriptions pr,patientcounts pc,practices pa
where pr.practice=pc.code
and pr.practice=pa.code and pa.period=pr.period
group by county,practice) as byprac
group by county
having patients > 100000
order by costperpatient desc limit 20;
db.prescriptions.aggregate([
{"$group" : {"_id" : { "county": "$address.county”,
"practice": "$practice"},"spend" : { "$sum" : {"$sum" :
"$prescriptions.cost"}}, "numpatients" : { "$avg" :
"$numpatients"}}},
{ "$group": { "_id" : "$_id.county", "spend" :{ "$sum" :
"$spend" }, "patients" : {"$sum": "$patients"}}},
{"$addFields" : { "costperpatient" : { "$divide”
:["$spend","$patients"] }}},
{"$match" : { "numpatients" : { "$gt" : 100000}}},
{"$sort" : { "costperpatient" : -1}}
,{$limit:20} ])
160 Seconds 66 Seconds
24% 0IOPS 0IOPS 24% CPU 0 IOPS 0 MBs
53. Result
Spend (£M) Patients Per Patient(£)
LINCOLNSHIRE 122 699309 175
WIRRAL 25 149554 172
CO DURHAM 102 596638 171
CLEVELAND 75 462593 163
ISLE OF WIGHT 25 144555 162
54. Result
Spend (£M) Patients Per Patient(£)
LINCOLNSHIRE 122 699309 175
WIRRAL 25 149554 172
CO DURHAM 102 596638 171
CLEVELAND 75 462593 163
ISLE OF WIGHT 25 144555 162
Spend (£M) Patients Per Patient(£)
BERKSHIRE 102 944538 108
MIDDLESEX 150 1469189 102
BRISTOL 14 145660 97
LONDON 522 5672564 92
LEEDS 9 122785 74
55. £/patient/county – nested SELECT (Disk)
select county,sum(totalcost) as spend,sum(patients) as
patients,sum(totalcost)/sum(patients) as costperpatient
from
(select county,sum(cost) as totalcost, avg(numpatients)
as patients
from prescriptions pr,patientcounts pc,practices pa
where pr.practice=pc.code
and pr.practice=pa.code and pa.period=pr.period
group by county,practice) as byprac
group by county
having patients > 100000
order by costperpatient desc limit 20;
db.prescriptions.aggregate([
{"$group" : {"_id" : { "county": "$address.county”,
"practice": "$practice"},"spend" : { "$sum" : {"$sum" :
"$prescriptions.cost"}}, "numpatients" : { "$avg" :
"$numpatients"}}},
{ "$group": { "_id" : "$_id.county", "spend" :{ "$sum" :
"$spend" }, "patients" : {"$sum": "$patients"}}},
{"$addFields" : { "costperpatient" : { "$divide”
:["$spend","$patients"] }}},
{"$match" : { "numpatients" : { "$gt" : 100000}}},
{"$sort" : { "costperpatient" : -1}}
,{$limit:20} ])
220 Seconds 67 Seconds
21% CPU 700 IOPS 68 MB/s 23% CPU 640 IOPS 82 MB/s
56. Most common drugs – REGROUP (RAM)
select bnfcode,max(name), sum(nitems) as items
from nhs.prescriptions
group
by bnfcode
order by items desc
limit 10;
{ "$unwind":"$prescriptions"},
{"$group" :
_id:‘$prescriptions.bnfcode’,
name:{$max:’$prescriptions.name’},
items:{$sum:’$prescriptions.nitems’}}},
{"$sort" : { "items" : -1}},
{"$limit":10}]
300 Seconds 262 Seconds
23% CPU 0 IOPS 0MB/s 25% CPU 0 IOPS 0MB/s
126 Seconds (Indexed)
25% CPU 0 IOPS 0MB/S
57. Grouping Techniques
SQL
Can take advantage of index ordering
by group key, all items with same key
come together so can process one at a
time.
1,1,1,1,1,2,2,2,2,3,3
Uses a temp table and sort when it
can’t.
MongoDB
Does not take advantage of ordering
(yet) – maintains a data structure
with all values.
Assumed you will want to group further
down the pipeline so optimised for
that.
Builds a tree (using disk) for the
values.
59. Most common drugs – REGROUP (Disk)
select bnfcode,max(name), sum(nitems) as items
from nhs.prescriptions
group
by bnfcode
order by items desc
limit 10;
{ "$unwind":"$prescriptions"},
{"$group" :
_id:‘$prescriptions.bnfcode’,
name:{$max:’$prescriptions.name’},
items:{$sum:’$prescriptions.nitems’}}},
{"$sort" : { "items" : -1}},
{"$limit":10}]
1427 Seconds 262 Seconds
4% CPU 1800 IOPS 100 MB/S 24% CPU 180 IOPS 23 MB/s
192 Seconds (Index)
13% CPU 520 IOPS 56MB/s
60. Most Expensive Drugs - Result
Rivaroxaban_Tab 20mg Anti Coagulent £100,007,025
Apixaban_Tab 5mg Anti Coagulent £79,302,385
Fostair_Inh 100mcg/6mcg (120D) C Asthma £75,541,726
Tiotropium_Pdr For Inh Cap 18mcg COPD £66,348,167
Sitagliptin_Tab 100mg Diabetes £60,919,725
Symbicort_Turbohaler 200mcg/6mcg Asthma £44,314,887
Apixaban_Tab 2.5mg Anti Coagulent £41,290,937
Ins Lantus SoloStar_100u/ml 3ml Diabetes £41,182,602
Ezetimibe_Tab 10mg Cholesterol £40,756,234
Linagliptin_Tab 5mg Diabetes £38,503,893
61. Anomaly Detection – JOIN Derived (RAM)
SELECT
prescriptions.bnfcode,MAX(prescriptions.name),
prescriptions.practice,MAX(practices.name),
AVG(nitems),AVG(patientcounts.numpatients),
AVG(aveperperson),AVG((nitems / patientcounts.numpatients) /
aveperperson) AS ratio
FROM
prescriptions
LEFT JOIN
(SELECT
bnfcode, AVG(nitems / numpatients) AS aveperperson
FROM
prescriptions, patientcounts
WHERE
prescriptions.practice = patientcounts.code
GROUP BY bnfcode) AS avgs ON avgs.bnfcode = prescriptions.bnfcode
LEFT JOIN
patientcounts ON prescriptions.practice = patientcounts.code
LEFT JOIN
practices ON practices.code = prescriptions.practice
WHERE
patientcounts.numpatients > 500
AND aveperperson > 0
AND prescriptions.practice NOT IN ('Y01924')
GROUP BY prescriptions.practice,prescriptions.bnfcode
ORDER BY ratio DESC
LIMIT 10;
db.prescriptions.aggregate([{"$unwind":"$prescriptions"},{"$group":{"_id
":"$prescriptions.bnfcode","aveperperson":{"$avg":{"$divide":["$prescrip
tions.nitems","$numpatients"]}}}},{"$match":{"aveperperson":{"$ne":null}
}},{"$out":"typical"}])
db.prescriptions.aggregate([{"$match":{"numpatients":{"$gt":500},"actice
":{"$nin":["Y01924"]}}},{"$unwind":"$prescriptions"},{"$group":{"_id":{"
bnfcode":"$prescriptions.bnfcode","practice":"$practice"},"name":{"$max"
:"$prescriptions.name"},"pracicename":{"$max":"$address.name"},"nitems":
{"$sum":"$prescriptions.nitems"},"numpatients":{"$max":"$numpatients"},"
nmonths":{"$sum":1}}},{"$addFields":{"perpatient":{"$divide":[{"$divide"
:["$nitems","$nmonths"]},"$numpatients"]}}},{"$lookup":{"from":"typical"
,"localField":"_id.bnfcode","foreignField":"_id","as":"typical"}},{"$unw
ind":"$typical"},{"$addFields":{"ratio":{"$divide":["$perpatient","$typi
cal.aveperperson"]}}},{"$sort":{"ratio":-1}},{"$limit":10}])
62. Anomaly Detection – JOIN Derived (RAM)
SELECT
prescriptions.bnfcode,MAX(prescriptions.name),
prescriptions.practice,MAX(practices.name),
AVG(nitems),AVG(patientcounts.numpatients),
AVG(aveperperson),AVG((nitems / patientcounts.numpatients) /
aveperperson) AS ratio
FROM
prescriptions
LEFT JOIN
(SELECT
bnfcode, AVG(nitems / numpatients) AS aveperperson
FROM
prescriptions, patientcounts
WHERE
prescriptions.practice = patientcounts.code
GROUP BY bnfcode) AS avgs ON avgs.bnfcode = prescriptions.bnfcode
LEFT JOIN
patientcounts ON prescriptions.practice = patientcounts.code
LEFT JOIN
practices ON practices.code = prescriptions.practice
WHERE
patientcounts.numpatients > 500
AND aveperperson > 0
AND prescriptions.practice NOT IN ('Y01924')
GROUP BY prescriptions.practice,prescriptions.bnfcode
ORDER BY ratio DESC
LIMIT 10;
db.prescriptions.aggregate([{"$unwind":"$prescriptions"},{"$group":{"_id
":"$prescriptions.bnfcode","aveperperson":{"$avg":{"$divide":["$prescrip
tions.nitems","$numpatients"]}}}},{"$match":{"aveperperson":{"$ne":null}
}},{"$out":"typical"}])
db.prescriptions.aggregate([{"$match":{"numpatients":{"$gt":500},"actice
":{"$nin":["Y01924"]}}},{"$unwind":"$prescriptions"},{"$group":{"_id":{"
bnfcode":"$prescriptions.bnfcode","practice":"$practice"},"name":{"$max"
:"$prescriptions.name"},"pracicename":{"$max":"$address.name"},"nitems":
{"$sum":"$prescriptions.nitems"},"numpatients":{"$max":"$numpatients"},"
nmonths":{"$sum":1}}},{"$addFields":{"perpatient":{"$divide":[{"$divide"
:["$nitems","$nmonths"]},"$numpatients"]}}},{"$lookup":{"from":"typical"
,"localField":"_id.bnfcode","foreignField":"_id","as":"typical"}},{"$unw
ind":"$typical"},{"$addFields":{"ratio":{"$divide":["$perpatient","$typi
cal.aveperperson"]}}},{"$sort":{"ratio":-1}},{"$limit":10}])
2250 Seconds 1489 Seconds
24% CPU 1020 IOPS 90MB/s 25% CPU 0 IOPS 0MB/s
63. Results
DRUG PRACTICE RATIO
Methadone FULCRUM 297 Rehab
Trazodone CARE HOMES
MEDICAL
242 Elderly Care
Buprenorphine FULCRUM 233
Thickenup PDR CARE HOMES
MEDICAL
174
Vitrex Nitrile Gloves REETH MEDICAL 173 Preference?
Ema Film Gloves NEW SPrintwells1 168
Pro D3 Cap PALFREY HEALTH
CENTRE
123 Vitamin D?
Fultium D3 Cap MOHANTY 123 Vitamin D
65. Anomaly Detection – JOIN Derived (Disk)
SELECT
prescriptions.bnfcode,MAX(prescriptions.name),
prescriptions.practice,MAX(practices.name),
AVG(nitems),AVG(patientcounts.numpatients),
AVG(aveperperson),AVG((nitems / patientcounts.numpatients) /
aveperperson) AS ratio
FROM
prescriptions
LEFT JOIN
(SELECT
bnfcode, AVG(nitems / numpatients) AS aveperperson
FROM
prescriptions, patientcounts
WHERE
prescriptions.practice = patientcounts.code
GROUP BY bnfcode) AS avgs ON avgs.bnfcode = prescriptions.bnfcode
LEFT JOIN
patientcounts ON prescriptions.practice = patientcounts.code
LEFT JOIN
practices ON practices.code = prescriptions.practice
WHERE
patientcounts.numpatients > 500
AND aveperperson > 0
AND prescriptions.practice NOT IN ('Y01924')
GROUP BY prescriptions.practice,prescriptions.bnfcode
ORDER BY ratio DESC
LIMIT 10;
db.prescriptions.aggregate([{"$unwind":"$prescriptions"},{"$group":{"_id
":"$prescriptions.bnfcode","aveperperson":{"$avg":{"$divide":["$prescrip
tions.nitems","$numpatients"]}}}},{"$match":{"aveperperson":{"$ne":null}
}},{"$out":"typical"}])
db.prescriptions.aggregate([{"$match":{"numpatients":{"$gt":500},"actice
":{"$nin":["Y01924"]}}},{"$unwind":"$prescriptions"},{"$group":{"_id":{"
bnfcode":"$prescriptions.bnfcode","practice":"$practice"},"name":{"$max"
:"$prescriptions.name"},"pracicename":{"$max":"$address.name"},"nitems":
{"$sum":"$prescriptions.nitems"},"numpatients":{"$max":"$numpatients"},"
nmonths":{"$sum":1}}},{"$addFields":{"perpatient":{"$divide":[{"$divide"
:["$nitems","$nmonths"]},"$numpatients"]}}},{"$lookup":{"from":"typical"
,"localField":"_id.bnfcode","foreignField":"_id","as":"typical"}},{"$unw
ind":"$typical"},{"$addFields":{"ratio":{"$divide":["$perpatient","$typi
cal.aveperperson"]}}},{"$sort":{"ratio":-1}},{"$limit":10}])
7848 Seconds 1655 Seconds
24% CPU 1020 IOPS 90MB/s 25% CPU 170 IOPS 24MB/s
66. Conclusions
• MongoDB is faster from disk when there are no indexes
• MongoDB is generally faster for more complex queries
• MongoDB fits the data-lake model nicely.
73. BI Connector
• Total Spend By Period.
• Sum one column Group by Primary Key
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 160 39 63
RAM 63 30 60 187
74. BI Connector
• Total Spend By PRACTICE.
• Sum one column Group by PART OF KEY
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 160 55 62
RAM 75 51 62 230
75. BI Connector
• Total Spend By PATIENT.
• Sum one column Group BY JOINED FIELD
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 160 62
RAM 105 62 230
76. BI Connector
• AVG Spend By SPEND PER PATIENT PER COUNTY .
• Sum one column Group BY COMPUTED FIELD
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 220 67
RAM 160 66 220
77. BI Connector
• MOST prescribed drugs.
• Sum one column Group BY Not PK
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 1427 192 262
RAM 300 126 262 280
78. BI Connector
• Anamoly Detection.
• Sum one column Group from subquery Joined table
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 7848 1655
RAM 2250 1489 D.N.F !
79. BI Connector – Did Not Finish – why?
• Anomaly Detection.
• Did Run – but was taking a long time
• Using expressive $lookup for the join
$lookup: {
from: ”prescriptions”,
let: { drug : “aspirin” },
pipeline: [
group total by drugname,practice,
divide by practise size,
group by drugname,
match $$drug
],
as: <output array field>
}
80. BI Connector – why Did Not Finish
• Anomaly Detection.
• Did run – but was taking a long time
• Using expressive $lookup for the join
• No Index on in-memory table
• Hand written version
• used temp collection
• Made use _id was lookup field
$lookup: {
from: ”prescriptions”,
let: { drug : “aspirin” },
pipeline: [
group total by drugname,practice,
divide by practise size,
group by drugname,
match $$drug
],
as: <output array field>
}
81. Conclusions
• BI Connector a little slower than RDBMS for simple queries
• The more complicated the query, the faster it is relatively.
• It’s not as quick as hand crafted Aggregation
• But you can put that in views.
• But it’s very convenient
• You can use your existing BI Tooling
• And you could always use Charts instead.
84. Pros
• Simpler to write much more complicated processing.
• Lots of libraries of pre written code
• Able to perform a lot of in-memory computation
• MongoDB can send them data very, very quickly
85. Cons
• Costs of transferring from or inside
clouds
• Atlas
• AWS
• Network speed limitations.
• Additional hardware.
• Security considerations.
AWS Same Region 1 cent / GB
AWS Between
Regions
9 cents / GB
AWS Out to 11 cents / GB
86. So do I use Spa^HR^doop! Or not?
• Yes – those tools are great for many things
• But always push computation DOWN to MongoDB if you can
• There is a balance
• Amount of effort to write as a Pipeline
• Reduced network costs in time and money
87. Simple Example
• Pearsons RHO
• Degree of correlation between two numeric lists
• Lets compare Lattitude (North vs South)
• And Quantity of drug per person
• Hypothesis “For some drugs, more is prescribed as you travel, up or down the UK”
88. Step 1 - Geocoding
• We need to augment our records with Lat/Long
• Download a handy set of postcode centroids
• mongoimport --type csv --headerline -d nhs postcodes.csv
• Use $lookup and $out to attach to each record and make new collection.
90. Step 2 – Group by drug
• For each drug compute average quantity/10,000 patients per
surgery
• Group to one record per drug with an array of objects
{
drug : “Aspirin”,
prescribed : [
{ where : [ -3.5, 55.2],
per10k : 75.4 }
…
]
}
91. Group by BNF code
unwind ={ $unwind:"$prescriptions"}
regroup = { $group : {
_id : "$prescriptions.name",
p : { $push : { where : ["$lon","$lat"] ,
per10k : { $multiply: [10000,
{$divide : [ "$prescriptions.nitems",
"$numpatients"]}]}}}}}
92. Step 1 – Pearsons Rho
• Compute RHO on Array comparing per10K to latitude.
100. Conclusions
• The Aggregation Framework is fast.
• There is no truth to “RDBMS is Just better”
• It’s a good choice for non trivial, ad-hoc queries.
• It’s a good choice for large data sets
• Consider sharding and microsharding.
• In a Cloud world – push work to the database
• Even with R/SAS/Spark! Etc.
Editor's Notes
Intro – Shard N – Tech heavy – understandable by all
Not my usual talk, eviews 40% too easy, 40% too hard – this is for you 20% in middle
Clarify – what if the Live data is in MongoDB and you want to report
In place? Copy to RDBMS?
Clarify – what if the Live data is in MongoDB and you want to report
In place? Copy to RDBMS?
Clarify – what if the Live data is in MongoDB and you want to report
In place? Copy to RDBMS?
Clarify – what if the Live data is in MongoDB and you want to report
In place? Copy to RDBMS?
Imagine – all things being equal. You want to build a Data warehouse, or a Data Lake or – you don’t really Know just report on the data you have.
How does MongoDB compare to a more traditional approach.
Although you do have to differentiae between OLAP and Warehouseing etc.
Clarify – what if the Live data is in MongoDB and you want to report
In place? Copy to RDBMS?