SlideShare a Scribd company logo
1 of 32
Download to read offline
Efficient Django
QuerySet Use
Jayson Falkner, PhD
jayson@counsyl.com
@jaysonfalkner
Overview
• 30 min talk
• How to go from Django QuerySet to SQL
• Straight-forward way to profile
• Fast CRUD (Create, Read, Update, Delete)
Assumptions
• You are using Python + Django
• Avoiding direct SQL use
• Postgres (aka PostgreSQL)
• Performance matters
• Slides and code are on GitHub
github.com/jfalkner/Efficient-Django-QuerySet-Use
A simple data model
class Sample(models.Model):
barcode = models.CharField(max_length=10, unique=True)
production = models.BooleanField()
created = models.DateTimeField()
def status(self):
return self.statuses.all()[0]
class SampleStatus(models.Model):
sample = models.ForeignKey(Sample,
related_name='statuses')
status_code = models.PositiveSmallIntegerField()
created = models.DateTimeField()
RECEIVED = 1; LAB = 2; COMPLETE = 3
Don’t do this
# What samples are in the lab?
samples = Sample.objects.filter(
production=True,
statuses__status_code=SampleStatus.LAB)
# Loop through. It’s business time!
for s in samples:
do_something(s.barcode,
s.status().created)
What takes time?
Perform SQL
Serialize ResultsParse SQL
Runtime (Django +Your Code)
Parse ResultsSerialize SQL
Python
Postgres (RDMS)
Socket
Poor Man’s Django/QuerySet Profiling
from datetime import datetime
from django.db import connection as con
from example.models import Sample
start = datetime.now()
query_count = len(con.queries)
samples = Sample.objects.filter(
production=True,
statuses__status_code =SampleStatus.LAB)
for s in samples:
do_something(s.barcode,
s.status().created)
print “Time: %s”%(datetime.now()-start)
print “Queries: %s”%(len(con.queries)-query_count)
for query in connection.queries[query_count:]:
print query
from datetime import datetime
from django.db import connection as con
from example.models import Sample, SampleStatus
start = datetime.now()
query_count = len(con.queries)
samples = Sample.objects.filter(
production=True,
statuses__status_code = SampleStatus.LAB)
for s in samples:
do_something(s.barcode,
s.status().created)
print “Time: %s”%(datetime.now()-start)
print “Queries: %s”%(len(con.queries)-query_count)
for query in connection.queries[query_count:]:
print query
Poor Man’s Django/QuerySet Profiling
from datetime import datetime
from django.db import connection as con
from example.models import Sample, SampleStatus
import django_pg_utils
django_pg_utils.track_sql()
samples = Sample.objects.filter(
production=True,
statuses__status_code = SampleStatus.LAB)
for s in samples:
do_something(s.barcode,
s.status().created)
django_pg_utils.print_sql()
Poor Man’s Django/QuerySet Profiling
github.com/jfalkner/Efficient-Django-QuerySet-Use
# Read all matching samples.
samples = Sample.objects.filter(
production=True,
statuses__status_code=SampleStatus.LAB)
# Loop with query for `created` value.
for s in samples:
do_something(s.barcode, s.status().created)
SQL Read in QuerySet (1,000 samples)
Python:1.112s, Postgres: 0.009s, Queries: 1,001
(And one more for each other sample...)
SELECT "id", "sample_id", "status", "created"
FROM "db_samplestatus"
WHERE "sample_id" = 10
ORDER BY "created" DESC LIMIT 1
SELECT "id", "barcode", "production", "created"
FROM "db_sample"
WHERE "db_sample"."production" = TRUE
Why is it slow?
• O(n) queries when it could be O(1)
• One query for all samples
• One query per status of sample
• Serializes/deserializes unused values
• Sample: id, created
• SampleStatus: id, status, sample_id
samples = Sample.objects.filter(
production=True,
statuses__status_code=SampleStatus.LAB)
# Cache all lookups once in Python dict.
samples = samples.prefetch_related(‘statuses’)
for s in samples:
do_something(s.barcode, s.status().created)
SQL Read in QuerySet (1,000 samples)
Python: 0.460s, Postgres: 0.018s, Queries: 2
SELECT "id", "barcode", "production", "created"
FROM "db_sample"
INNER JOIN "db_samplestatus" ON
("id" = "db_samplestatus"."sample_id")
WHERE ("production" = true AND
"db_samplestatus"."status_code" = 2 )
SELECT "id", "sample_id", "status_code", "created"
FROM "db_samplestatus"
WHERE "sample_id" IN (1220001, 1220002, 1220003 ...
Improving even more
• O(1) query = Good
• Python dict for lookups is relatively slow
• Let Postgres do the work
• Return only params of interest
• Minimizes serialize/deserialize time
• Smaller memory footprint
Sample.objects
# JOIN and calc max status code from SampleStatus.
.annotate(latest_status_code=Max(
'statuses__status_code'))
.filter(production=True,
latest_status_code__eq=SampleStatus.LAB)
# Return exactly the fields to use.
.values_list('barcode', 'statuses__created')
SQL Read in QuerySet (10,000 samples)
Python: 0.23s, Postgres: 0.18s, Queries: 1
SELECT "barcode", "created"
FROM "db_sample"
LEFT OUTER JOIN "db_samplestatus" ON
("id" = "db_samplestatus"."sample_id")
WHERE ("production" = TRUE)
GROUP BY "id", "barcode", "production", "created",
"status_code", "db_samplestatus"."status_code"
HAVING MAX("db_samplestatus"."status_code") = 2
class Sample(models.Model):
barcode = models.CharField(max_length=10, unique=True)
production = models.BooleanField()
created = models.DateTimeField()
latest_status = models.ForeignKey(
'SampleStatus', related_name='+', null=True)
def status(self):
return self.statuses.all()[0]
class SampleStatus(models.Model):
...
Using select_related()
JOIN via denormalized foreign key on Sample
samples = Sample.objects.filter(
production=True,
statuses__status_code=SampleStatus.LAB)
samples = samples.select_related('latest_status')
for s in samples:
do_something((s.barcode, s.latest_status.created))
SQL Read in QuerySet (10,000 samples)
Python: 0.71s, Postgres: 0.13s, Queries: 1
SELECT "id", "barcode", "production", "created",
"latest_status_id",
T3."id", T3."sample_id", T3."status",
T3."status_code", T3."created"
FROM "example_sample"
INNER JOIN "example_samplestatus" ON
("id" = "example_samplestatus"."sample_id")
LEFT OUTER JOIN "example_samplestatus" T3 ON
("latest_status_id" = T3."id")
WHERE ("production" = TRUE
AND "example_samplestatus"."status_code" = 2)
class Sample(models.Model):
barcode = models.CharField(max_length=10, unique=True)
production = models.BooleanField()
created = models.DateTimeField()
status_code = models.PositiveSmallIntegerField()
status_changed = models.DateTimeField()
def status(self):
return self.statuses.all()[0]
class SampleStatus(models.Model):
...
Denormalize for speed
Change the data model to avoid a JOIN
# Read all at once and return a list of tuples.
Sample.objects
.filter(production=True,
status_code = SampleStatus.LAB)
.values_list('barcode', 'status_changed')
SQL Read in QuerySet (1,000,000 samples)
Python: 0.18s, Postgres: 0.17s, Queries: 1
SELECT "db_sample"."barcode",
"db_sample"."status_code"
FROM "db_sample"
WHERE ("db_sample"."status_code" = 2
AND "db_sample"."production" = TRUE)
Why is it faster?
• JOINs and table scans are relatively slow
• SQL’s EXPLAIN helps show this
postgresql.org/docs/9.1/static/using-explain.html
EXPLAIN SELECT "db_sample"."barcode",
"db_sample"."status_code"
FROM "db_sample"
WHERE ("db_sample"."status_code" = 2
AND "db_sample"."production" = TRUE)
Seq Scan on db_sample (cost=0.00..27203.00 rows=5567 width=8)
Filter: (production AND (status_code = 2))
GroupAggregate (cost=691622.75..787997.34 rows=2965372 width=23)
Filter: (max(db_samplestatus.status_code) = 2)
-> Sort (cost=691622.75..699036.18 rows=2965372 width=23)
Sort Key: id, barcode, production, created, status_code,
-> Hash Right Join (cost=43063.00..190400.49 rows=2965372 width=23)
Hash Cond: (db_samplestatus.sample_id = db_sample.id)
-> Seq Scan on db_samplestatus (cost=0.00..51015.90 rows=2993590 width=6)
-> Hash (cost=24703.00..24703.00 rows=1000000 width=21)
-> Seq Scan on db_sample (cost=0.00..24703.00 rows=1000000 width=21)
Filter: production
SQL Read in QuerySet (1,000,000 samples)
Denormalized = 0.175s
Seq Scan on db_sample (cost=0.00..27203.00 rows=5567 width=8)
Filter: (production AND (status_code = 2))
Sample, SampleStatus JOIN = 11s
Denormalized + Multicolumn INDEX = 0.004s
Index Scan using db_sample_prod_lab on db_sample (cost=0.00..9101.15 rows=5567 width=8)
Index Cond: ((status_code = 2) AND (production = true))
Filter: production
class Sample(models.Model):
barcode = models.CharField(max_length=10, unique=True)
production = models.BooleanField()
created = models.DateTimeField()
status_code = models.PositiveSmallIntegerField(index=True)
status_changed = models.DateTimeField()
Multi-Column Indexing
from django_pg_utils import pg_multicolumn_index
pg_multicolumn_index(Sample, ['production', 'status_code'])
Django Model ‘index’ won’t help
Postgres-specific CREATE INDEX helper method
How long does it take?
QuerySet SELECT # Samples Time (Postgres)
Loop 1,000 1.11s (0.01s)
Loop prefetch_selected() 1,000 0.46s (0.02s)
Loop select_related() 10,000 0.71s (0.13s)
values_list() + JOIN 10,000 0.23s (0.19s)
Denorm + INDEX 100,000,000 0.99s (0.38s)
On a MacBook Pro Retina. Postgres 9.1. No config tweaks.
In context at Counsyl
• How much is 100,000,000 samples?
• query: samples to process this week?
• Consider Counsyl’s main product
• 4,000,000 US pregnancies per year
• Screen 2x people (mother and father)
• That is 12 years of samples!
Oh yeah, other stuff!
• Batching CRUD actions = speed
• Avoid any O(n) loops on model objects
• Bulk CREATE, UPDATE, DELETE exist
• bulk_create(), update(), delete()
• save() won’t work
• Helper code for multi-value UPDATE
docs.djangoproject.com/en/dev/topics/db/queries/
SELECT "id", "barcode", "production", "created"
FROM "db_sample"
WHERE "production" = TRUE
SELECT (1) AS "a"
FROM "db_sample"
WHERE "id" = 1 LIMIT 1
samples = Sample.objects.filter(production=True)
# Loop, update, and invoke `save()` for each.
for s in samples:
s.barcode = 'PREFIX'+s.barcode
s.save()
Inefficient UPDATE in QuerySet
Queries: O(n)
UPDATE "db_sample"
SET "barcode" = 'PREFIX0',
"production" = TRUE, "created" = '2013-09-29'
WHERE "db_sample"."id" = 1
(And one more for each other sample...)
# Loop and make samples one at a time.
now = datetime.now()
(Sample.objects.filter(production=True)
.update(created=now))
Efficient UPDATE in QuerySet
UPDATE "db_sample"
SET "created" = '2013-09-29 03:40:18.925695-05:00'
WHERE "db_sample"."production" = TRUE
Queries: O(1)
No multi-value update?
• Django’s update() is limited to one value
• SQL has no such restriction
SET barcode = input.update
FROM
(SELECT unnest(ARRAY[1, 2, 3, 4, 5, 6]),
unnest(ARRAY[6, 6, 6, 6, 6, 6])) AS INPUT (filter,
UPDATE)
WHERE id = INPUT.filter;
values = Sample.objects.values_list('id', 'barcode')
# Loop and make samples one at a time.
filter_vals = [id for id, _ in values]
update_vals = ['PREFIX'+bar for _, bar in values]
django_pg_utils.pg_bulk_update(Sample, 'id', 'barcode',
filter_vals, update_vals)
Efficient UPDATE in QuerySet
Queries: O(1)
# Loop and make samples one at a time.
for barcode in barcodes:
Sample.objects.create(barcode=barcode,
production=True,
created=datetime.now())
Inefficient CREATE in QuerySet
Queries: O(n)
INSERT INTO "example_sample" ("barcode",
"production",
"created")
VALUES ('01234', TRUE, '2013-10-04 01:55:15.456163-05:00',
2,
'2013-10-04 01:55:15.456173-05:00')
RETURNING "example_sample"."id"
(And one more for each other sample...)
# Buffer all Sample instances with values in a list.
samples = []
for barcode in barcodes:
samples.append(Sample(barcode=barcode,
production=True,
created=datetime.now())
# Bulk create all samples at once.
Sample.objects.bulk_create(samples)
Efficient CREATE in QuerySet
Queries: O(1)
INSERT INTO "example_sample" ("barcode",
"production",
"created")
VALUES ('100000', TRUE, '2013-10-04 02:00:01.556393-05:00'),
('100001', TRUE, '2013-10-04 02:00:01.556483-05:00'),
('100002', TRUE, '2013-10-04 02:00:01.556526-05:00'),
...
Summary
• Batch/Bulk Everything FTW
• READ: values() and values_list()
•pg_multicolumn_index()
• CREATE: bulk_create()
• UPDATE: update() and pg_bulk_update()
• DELETE: delete()
github.com/jfalkner/Efficient-Django-QuerySet-Use
Efficient Django
QuerySet Use
Jayson Falkner, PhD
jayson@counsyl.com
@jaysonfalkner
github.com/jfalkner/Efficient-Django-QuerySet-Use

More Related Content

Recently uploaded

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

Efficient Django QuerySet Use

  • 1. Efficient Django QuerySet Use Jayson Falkner, PhD jayson@counsyl.com @jaysonfalkner
  • 2. Overview • 30 min talk • How to go from Django QuerySet to SQL • Straight-forward way to profile • Fast CRUD (Create, Read, Update, Delete)
  • 3. Assumptions • You are using Python + Django • Avoiding direct SQL use • Postgres (aka PostgreSQL) • Performance matters • Slides and code are on GitHub github.com/jfalkner/Efficient-Django-QuerySet-Use
  • 4. A simple data model class Sample(models.Model): barcode = models.CharField(max_length=10, unique=True) production = models.BooleanField() created = models.DateTimeField() def status(self): return self.statuses.all()[0] class SampleStatus(models.Model): sample = models.ForeignKey(Sample, related_name='statuses') status_code = models.PositiveSmallIntegerField() created = models.DateTimeField() RECEIVED = 1; LAB = 2; COMPLETE = 3
  • 5. Don’t do this # What samples are in the lab? samples = Sample.objects.filter( production=True, statuses__status_code=SampleStatus.LAB) # Loop through. It’s business time! for s in samples: do_something(s.barcode, s.status().created)
  • 6. What takes time? Perform SQL Serialize ResultsParse SQL Runtime (Django +Your Code) Parse ResultsSerialize SQL Python Postgres (RDMS) Socket
  • 7. Poor Man’s Django/QuerySet Profiling from datetime import datetime from django.db import connection as con from example.models import Sample start = datetime.now() query_count = len(con.queries) samples = Sample.objects.filter( production=True, statuses__status_code =SampleStatus.LAB) for s in samples: do_something(s.barcode, s.status().created) print “Time: %s”%(datetime.now()-start) print “Queries: %s”%(len(con.queries)-query_count) for query in connection.queries[query_count:]: print query
  • 8. from datetime import datetime from django.db import connection as con from example.models import Sample, SampleStatus start = datetime.now() query_count = len(con.queries) samples = Sample.objects.filter( production=True, statuses__status_code = SampleStatus.LAB) for s in samples: do_something(s.barcode, s.status().created) print “Time: %s”%(datetime.now()-start) print “Queries: %s”%(len(con.queries)-query_count) for query in connection.queries[query_count:]: print query Poor Man’s Django/QuerySet Profiling
  • 9. from datetime import datetime from django.db import connection as con from example.models import Sample, SampleStatus import django_pg_utils django_pg_utils.track_sql() samples = Sample.objects.filter( production=True, statuses__status_code = SampleStatus.LAB) for s in samples: do_something(s.barcode, s.status().created) django_pg_utils.print_sql() Poor Man’s Django/QuerySet Profiling github.com/jfalkner/Efficient-Django-QuerySet-Use
  • 10. # Read all matching samples. samples = Sample.objects.filter( production=True, statuses__status_code=SampleStatus.LAB) # Loop with query for `created` value. for s in samples: do_something(s.barcode, s.status().created) SQL Read in QuerySet (1,000 samples) Python:1.112s, Postgres: 0.009s, Queries: 1,001 (And one more for each other sample...) SELECT "id", "sample_id", "status", "created" FROM "db_samplestatus" WHERE "sample_id" = 10 ORDER BY "created" DESC LIMIT 1 SELECT "id", "barcode", "production", "created" FROM "db_sample" WHERE "db_sample"."production" = TRUE
  • 11. Why is it slow? • O(n) queries when it could be O(1) • One query for all samples • One query per status of sample • Serializes/deserializes unused values • Sample: id, created • SampleStatus: id, status, sample_id
  • 12. samples = Sample.objects.filter( production=True, statuses__status_code=SampleStatus.LAB) # Cache all lookups once in Python dict. samples = samples.prefetch_related(‘statuses’) for s in samples: do_something(s.barcode, s.status().created) SQL Read in QuerySet (1,000 samples) Python: 0.460s, Postgres: 0.018s, Queries: 2 SELECT "id", "barcode", "production", "created" FROM "db_sample" INNER JOIN "db_samplestatus" ON ("id" = "db_samplestatus"."sample_id") WHERE ("production" = true AND "db_samplestatus"."status_code" = 2 ) SELECT "id", "sample_id", "status_code", "created" FROM "db_samplestatus" WHERE "sample_id" IN (1220001, 1220002, 1220003 ...
  • 13. Improving even more • O(1) query = Good • Python dict for lookups is relatively slow • Let Postgres do the work • Return only params of interest • Minimizes serialize/deserialize time • Smaller memory footprint
  • 14. Sample.objects # JOIN and calc max status code from SampleStatus. .annotate(latest_status_code=Max( 'statuses__status_code')) .filter(production=True, latest_status_code__eq=SampleStatus.LAB) # Return exactly the fields to use. .values_list('barcode', 'statuses__created') SQL Read in QuerySet (10,000 samples) Python: 0.23s, Postgres: 0.18s, Queries: 1 SELECT "barcode", "created" FROM "db_sample" LEFT OUTER JOIN "db_samplestatus" ON ("id" = "db_samplestatus"."sample_id") WHERE ("production" = TRUE) GROUP BY "id", "barcode", "production", "created", "status_code", "db_samplestatus"."status_code" HAVING MAX("db_samplestatus"."status_code") = 2
  • 15. class Sample(models.Model): barcode = models.CharField(max_length=10, unique=True) production = models.BooleanField() created = models.DateTimeField() latest_status = models.ForeignKey( 'SampleStatus', related_name='+', null=True) def status(self): return self.statuses.all()[0] class SampleStatus(models.Model): ... Using select_related() JOIN via denormalized foreign key on Sample
  • 16. samples = Sample.objects.filter( production=True, statuses__status_code=SampleStatus.LAB) samples = samples.select_related('latest_status') for s in samples: do_something((s.barcode, s.latest_status.created)) SQL Read in QuerySet (10,000 samples) Python: 0.71s, Postgres: 0.13s, Queries: 1 SELECT "id", "barcode", "production", "created", "latest_status_id", T3."id", T3."sample_id", T3."status", T3."status_code", T3."created" FROM "example_sample" INNER JOIN "example_samplestatus" ON ("id" = "example_samplestatus"."sample_id") LEFT OUTER JOIN "example_samplestatus" T3 ON ("latest_status_id" = T3."id") WHERE ("production" = TRUE AND "example_samplestatus"."status_code" = 2)
  • 17. class Sample(models.Model): barcode = models.CharField(max_length=10, unique=True) production = models.BooleanField() created = models.DateTimeField() status_code = models.PositiveSmallIntegerField() status_changed = models.DateTimeField() def status(self): return self.statuses.all()[0] class SampleStatus(models.Model): ... Denormalize for speed Change the data model to avoid a JOIN
  • 18. # Read all at once and return a list of tuples. Sample.objects .filter(production=True, status_code = SampleStatus.LAB) .values_list('barcode', 'status_changed') SQL Read in QuerySet (1,000,000 samples) Python: 0.18s, Postgres: 0.17s, Queries: 1 SELECT "db_sample"."barcode", "db_sample"."status_code" FROM "db_sample" WHERE ("db_sample"."status_code" = 2 AND "db_sample"."production" = TRUE)
  • 19. Why is it faster? • JOINs and table scans are relatively slow • SQL’s EXPLAIN helps show this postgresql.org/docs/9.1/static/using-explain.html EXPLAIN SELECT "db_sample"."barcode", "db_sample"."status_code" FROM "db_sample" WHERE ("db_sample"."status_code" = 2 AND "db_sample"."production" = TRUE) Seq Scan on db_sample (cost=0.00..27203.00 rows=5567 width=8) Filter: (production AND (status_code = 2))
  • 20. GroupAggregate (cost=691622.75..787997.34 rows=2965372 width=23) Filter: (max(db_samplestatus.status_code) = 2) -> Sort (cost=691622.75..699036.18 rows=2965372 width=23) Sort Key: id, barcode, production, created, status_code, -> Hash Right Join (cost=43063.00..190400.49 rows=2965372 width=23) Hash Cond: (db_samplestatus.sample_id = db_sample.id) -> Seq Scan on db_samplestatus (cost=0.00..51015.90 rows=2993590 width=6) -> Hash (cost=24703.00..24703.00 rows=1000000 width=21) -> Seq Scan on db_sample (cost=0.00..24703.00 rows=1000000 width=21) Filter: production SQL Read in QuerySet (1,000,000 samples) Denormalized = 0.175s Seq Scan on db_sample (cost=0.00..27203.00 rows=5567 width=8) Filter: (production AND (status_code = 2)) Sample, SampleStatus JOIN = 11s Denormalized + Multicolumn INDEX = 0.004s Index Scan using db_sample_prod_lab on db_sample (cost=0.00..9101.15 rows=5567 width=8) Index Cond: ((status_code = 2) AND (production = true)) Filter: production
  • 21. class Sample(models.Model): barcode = models.CharField(max_length=10, unique=True) production = models.BooleanField() created = models.DateTimeField() status_code = models.PositiveSmallIntegerField(index=True) status_changed = models.DateTimeField() Multi-Column Indexing from django_pg_utils import pg_multicolumn_index pg_multicolumn_index(Sample, ['production', 'status_code']) Django Model ‘index’ won’t help Postgres-specific CREATE INDEX helper method
  • 22. How long does it take? QuerySet SELECT # Samples Time (Postgres) Loop 1,000 1.11s (0.01s) Loop prefetch_selected() 1,000 0.46s (0.02s) Loop select_related() 10,000 0.71s (0.13s) values_list() + JOIN 10,000 0.23s (0.19s) Denorm + INDEX 100,000,000 0.99s (0.38s) On a MacBook Pro Retina. Postgres 9.1. No config tweaks.
  • 23. In context at Counsyl • How much is 100,000,000 samples? • query: samples to process this week? • Consider Counsyl’s main product • 4,000,000 US pregnancies per year • Screen 2x people (mother and father) • That is 12 years of samples!
  • 24. Oh yeah, other stuff! • Batching CRUD actions = speed • Avoid any O(n) loops on model objects • Bulk CREATE, UPDATE, DELETE exist • bulk_create(), update(), delete() • save() won’t work • Helper code for multi-value UPDATE docs.djangoproject.com/en/dev/topics/db/queries/
  • 25. SELECT "id", "barcode", "production", "created" FROM "db_sample" WHERE "production" = TRUE SELECT (1) AS "a" FROM "db_sample" WHERE "id" = 1 LIMIT 1 samples = Sample.objects.filter(production=True) # Loop, update, and invoke `save()` for each. for s in samples: s.barcode = 'PREFIX'+s.barcode s.save() Inefficient UPDATE in QuerySet Queries: O(n) UPDATE "db_sample" SET "barcode" = 'PREFIX0', "production" = TRUE, "created" = '2013-09-29' WHERE "db_sample"."id" = 1 (And one more for each other sample...)
  • 26. # Loop and make samples one at a time. now = datetime.now() (Sample.objects.filter(production=True) .update(created=now)) Efficient UPDATE in QuerySet UPDATE "db_sample" SET "created" = '2013-09-29 03:40:18.925695-05:00' WHERE "db_sample"."production" = TRUE Queries: O(1)
  • 27. No multi-value update? • Django’s update() is limited to one value • SQL has no such restriction
  • 28. SET barcode = input.update FROM (SELECT unnest(ARRAY[1, 2, 3, 4, 5, 6]), unnest(ARRAY[6, 6, 6, 6, 6, 6])) AS INPUT (filter, UPDATE) WHERE id = INPUT.filter; values = Sample.objects.values_list('id', 'barcode') # Loop and make samples one at a time. filter_vals = [id for id, _ in values] update_vals = ['PREFIX'+bar for _, bar in values] django_pg_utils.pg_bulk_update(Sample, 'id', 'barcode', filter_vals, update_vals) Efficient UPDATE in QuerySet Queries: O(1)
  • 29. # Loop and make samples one at a time. for barcode in barcodes: Sample.objects.create(barcode=barcode, production=True, created=datetime.now()) Inefficient CREATE in QuerySet Queries: O(n) INSERT INTO "example_sample" ("barcode", "production", "created") VALUES ('01234', TRUE, '2013-10-04 01:55:15.456163-05:00', 2, '2013-10-04 01:55:15.456173-05:00') RETURNING "example_sample"."id" (And one more for each other sample...)
  • 30. # Buffer all Sample instances with values in a list. samples = [] for barcode in barcodes: samples.append(Sample(barcode=barcode, production=True, created=datetime.now()) # Bulk create all samples at once. Sample.objects.bulk_create(samples) Efficient CREATE in QuerySet Queries: O(1) INSERT INTO "example_sample" ("barcode", "production", "created") VALUES ('100000', TRUE, '2013-10-04 02:00:01.556393-05:00'), ('100001', TRUE, '2013-10-04 02:00:01.556483-05:00'), ('100002', TRUE, '2013-10-04 02:00:01.556526-05:00'), ...
  • 31. Summary • Batch/Bulk Everything FTW • READ: values() and values_list() •pg_multicolumn_index() • CREATE: bulk_create() • UPDATE: update() and pg_bulk_update() • DELETE: delete() github.com/jfalkner/Efficient-Django-QuerySet-Use
  • 32. Efficient Django QuerySet Use Jayson Falkner, PhD jayson@counsyl.com @jaysonfalkner github.com/jfalkner/Efficient-Django-QuerySet-Use