SlideShare a Scribd company logo
1 of 45
Download to read offline
AVOIDING BAD DATABASE
SURPRISES
Simulation and Scalability
SOME HORROR STORIES
WEB APP DOESN’T SCALE
You’ve got a brilliant app
You’ve got a brilliant cloud deployment
EC2, ALB — all the right moving parts
It still doesn't scale
S LOW
ANALYTICS DON’T SCALE
You’ve studied the data
You’ve got a model that’s hugely important
It explains things
It predicts things
But. It’s SLOW S LOW
YOU BLAMED PYTHON
Web:
NGINX,
uWSGI,
the proxy server,
the coffee shop
Analytics:
Pandas,
Scikit Learn,
Jupyter Notebook,
open office layouts
And More…
STACK OVERFLOW SAYS “PROFILE”
So you profiled
and you profiled
And…
It turns out it’s the database
HORROR MOVIE TROPE
There’s a monster
And it’s in your base
And it’s killing your dudes
It’s Your Database
KILLING THE MONSTER
Hard work
Lots of stress
Many Techniques
Indexing
Denormalization
I/O Concurrency (i.e., more devices)
Compression
CAN WE PREVENT ALL THIS?
Spoiler Alert: Yes
TO AVOID BECOMING A HORROR STORY
Simulate
Early
Often
A/K/A SYNTHETIC DATA
Why You Don’t Necessarily Need Data for Data Science
https://medium.com/capital-one-tech
HORROR MOVIE TROPES
Look behind the door
Don’t run into the dark barn alone
Avoid the trackless forest after dark
Stay with your friends
Don’t dismiss the funny noises
DATA IS THE ONLY THING
THAT MATTERS
Foundational Concept
SIDE-BAR ARGUMENT
UX is important, but secondary
But it’s never the kind of bottleneck app and DB servers are
You will also experiment with UX
I’m not saying ignore the UX
Data lasts forever. Data is converted and preserved.
UX is transient. Next release will have a better, more modern experience.
SIMULATE EARLY
Build your data models first
Build the nucleus of your application processing
Build a performance testing environment
With realistic volumes of data
SIMULATE OFTEN
When your data model changes
When you add features
When you acquire more sources of data
Rerun the nucleus of your application processing
With realistic volumes of data
PYTHON MAKES THIS EASY
WHAT YOU’LL NEED
A candidate data model
The nucleus of processing
RESTful API CRUD elements
Analytical Extract-Transform-Load-Map-Reduce steps
A data generator to populate the database
Measurements using time.perf_counter()
DATA MODEL
SQLAlchemy or Django ORM (or others, SQLObject, etc.)
Data Classes
Plain Old Python Objects (POPO) and JSON serialization
If we use JSON Schema validation, we can do cool stuff
THE PROBLEM: FLEXIBILITY
SQL provides minimal type information (string, decimal, date, etc.)
No ranges, enumerated values or other domain details (e.g., name vs. address)
Does provide Primary Key and Foreign Key information
Data classes provide more detailed type information
Still doesn’t include ranges or other domain details
No PK/FK help at all
A SOLUTION
Narrow type specifiations using JSON Schema
Examples to follow
class Card(Model):
"""
title: Card
description: "Simple Playing Cards"
type: object
properties:
suit:
type: string
enum: ["H", "S", "D", "C"]
rank:
type: integer
minimum: 1
maximum: 13
"""
JSON Schema Definition
In YAML Notation
HOW DOES THIS WORK?
A metaclass parses the schema YAML and builds a validator
An abstract superclass provides __init__() to validate the
document
import yaml
import json
import jsonschema
class SchemaMeta(type):
def __new__(mcs, name, bases, namespace):
# pylint: disable=protected-access
result = type.__new__(mcs, name, bases, dict(namespace))
result.SCHEMA = yaml.load(result.__doc__)
jsonschema.Draft4Validator.check_schema(result.SCHEMA)
result._validator = jsonschema.Draft4Validator(result.SCHEMA)
return result
Builds JSONSchema validator
from __doc__ string
class Model(dict, metaclass=SchemaMeta):
"""
title: Model
description: abstract superclass for Model
"""
@classmethod
def from_json(cls, document):
return cls(yaml.load(document))
@property
def json(self):
return json.dumps(self)
def __init__(self, *args, **kw):
super().__init__(*args, **kw)
if not self._validator.is_valid(self):
raise TypeError(list(self._validator.iter_errors(self)))
Validates object and raises TypeError
>>> h1 = Card.from_json('{"suit": "H", "rank": 1}')
>>> h1['suit']
'H'
>>> h1.json
'{"suit": "H", "rank": 1}'
Deserialize POPO from JSON text
Serialize POPO into JSON text
>>> d = Card.from_json('{"suit": "hearts", "rank": -12}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 8, in from_json
File "<stdin>", line 15, in __init__
TypeError: [<ValidationError: "'hearts' is not one of
['H', 'S', 'D', 'C']">, <ValidationError: '-12 is less
than the minimum of 1'>]
Fail to deserialize invalid POPO from JSON text
WHY?
JSON Schema allows us to provide
Type (string, number, integer, boolean, array, or object)
Ranges for numerics
Enumerated values (for numbers or strings)
Format for strings (i.e. email, uri, date-time, etc.)
Text Patterns for strings (more general regular expression handling)
DATABASE SIMULATION
AHA
With JSON schema we can build simulated data
THERE ARE SIX SCHEMA TYPES
null — Always None
integer — Use const, enum, minimum, maximum constraints
number — Use const, enum, minimum, maximum constraints
string — Use const, enum, format, or pattern constraints
There are 17 defined formats to narrow the constraints
array — recursively expand items to build an array
object — recursively expand properties to build a document
class Generator:
def __init__(self, parent_schema, domains=None):
self.schema = parent_schema
def gen_null(self, schema):
return None
def gen_string(self, schema): …
def gen_integer(self, schema): …
def gen_number(self, schema): …
def gen_array(self, schema):
doc = [self.generate(schema.get('items')) for _ in range(lo, hi+1)]
return doc
def gen_object(self, schema):
doc = {
name: self.generate(subschema)
for name, subschema in schema.get('properties', {}).items()
}
return doc
def generate(self, schema=None):
schema = schema or self.schema
schema_type = schema.get('type', 'object')
method = getattr(self, f"gen_{schema_type}")
return method(schema)
Finds gen_* methods
def make_documents(model_class, count=100, domains=None):
generator = Generator(model_class.SCHEMA, domains)
docs_iter = (generator.generate() for i in range(count))
for doc in docs_iter:
print(model_class(**doc))
Or write to a file
Or load a database
NOW YOU CAN SIMULATE
Early Often
WHAT ABOUT?
More sophisticated data domains?
Name, Address, Comments, etc.
More than text. No simple format.
Primary Key and Foreign Key Relationships
HANDLING FORMATS
def gen_string(self, schema):
if 'const' in schema:
return schema['const']
elif 'enum' in schema:
return random.choice(schema['enum'])
elif 'format' in schema:
return FORMATS[schema['format']]()
else:
return "string"
TS_RANGE = (datetime.datetime(1900, 1, 1).timestamp(),
datetime.datetime(2100, 12, 31).timestamp())
FORMATS = {
'date-time': (
lambda: datetime.datetime.utcfromtimestamp(
random.randrange(*TS_RANGE)
).isoformat()
),
'date': (
lambda: datetime.datetime.utcfromtimestamp(
random.randrange(*TS_RANGE)
).date().isoformat()
),
'time': (
lambda: datetime.datetime.utcfromtimestamp(
random.randrange(*TS_RANGE)
).time().isoformat()
),
DATA DOMAINS
String format (and enum) may not be enough to characterize data
Doing Text Search or Indexing? You want text-like data
Using Names or Addresses? Random strings may not be
appropriate.
Credit card numbers? You want 16-digit strings
EXAMPLE DOMAIN: DIGITS
def digits(n):
return ''.join(random.choice('012345789') for _ in range(n))
EXAMPLE DOMAIN: NAMES
class LoremIpsum:
_phrases = [
"Lorem ipsum dolor sit amet",
"consectetur adipiscing elit”,
…etc.…
"mollis eleifend leo venenatis"
]
@staticmethod
def word():
return
random.choice(random.choice(LoremIpsum._phrases).split())
@staticmethod
def name():
return ' '.join(LoremIpsum.word() for _ in range(3)).title()
RECAP
HOW TO GET INTO TROUBLE
Faith
Have faith the best practices you read in a blog really work
Assume
Assume you understand best practices you read in a blog
Hope
Hope you will somehow avoid scaling problems
SOME TROPES
Look behind the door
Don’t run into the dark barn alone
Avoid the trackless forest after dark
Stay with your friends
Don’t dismiss the funny noises
TO DO CHECKLIST
Simulate Early and Often
Define Python Classes
Use JSON Schema to provide fine-grained definitions
With ranges, formats, enums
Build a generator to populate instances in bulk
Gather Performance Data
Profit
AVOID DATABASE SURPRISES WITH EARLY SIMULATION

More Related Content

What's hot

XSLT and XPath - without the pain!
XSLT and XPath - without the pain!XSLT and XPath - without the pain!
XSLT and XPath - without the pain!Bertrand Delacretaz
 
Professional-grade software design
Professional-grade software designProfessional-grade software design
Professional-grade software designBrian Fenton
 
Querying XML: XPath and XQuery
Querying XML: XPath and XQueryQuerying XML: XPath and XQuery
Querying XML: XPath and XQueryKatrien Verbert
 
Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2Marco Gralike
 
SAX, DOM & JDOM parsers for beginners
SAX, DOM & JDOM parsers for beginnersSAX, DOM & JDOM parsers for beginners
SAX, DOM & JDOM parsers for beginnersHicham QAISSI
 
XML Support: Specifications and Development
XML Support: Specifications and DevelopmentXML Support: Specifications and Development
XML Support: Specifications and DevelopmentPeter Eisentraut
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)Serhii Kartashov
 
Object Relational Mapping in PHP
Object Relational Mapping in PHPObject Relational Mapping in PHP
Object Relational Mapping in PHPRob Knight
 
PhD Presentation
PhD PresentationPhD Presentation
PhD Presentationmskayed
 

What's hot (20)

Xpath
XpathXpath
Xpath
 
Parsing XML Data
Parsing XML DataParsing XML Data
Parsing XML Data
 
XSLT and XPath - without the pain!
XSLT and XPath - without the pain!XSLT and XPath - without the pain!
XSLT and XPath - without the pain!
 
XML SAX PARSING
XML SAX PARSING XML SAX PARSING
XML SAX PARSING
 
XML and XPath details
XML and XPath detailsXML and XPath details
XML and XPath details
 
Professional-grade software design
Professional-grade software designProfessional-grade software design
Professional-grade software design
 
Xm lparsers
Xm lparsersXm lparsers
Xm lparsers
 
Querying XML: XPath and XQuery
Querying XML: XPath and XQueryQuerying XML: XPath and XQuery
Querying XML: XPath and XQuery
 
Computer project
Computer projectComputer project
Computer project
 
Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2
 
6 xml parsing
6   xml parsing6   xml parsing
6 xml parsing
 
Xpath presentation
Xpath presentationXpath presentation
Xpath presentation
 
SAX, DOM & JDOM parsers for beginners
SAX, DOM & JDOM parsers for beginnersSAX, DOM & JDOM parsers for beginners
SAX, DOM & JDOM parsers for beginners
 
XML Support: Specifications and Development
XML Support: Specifications and DevelopmentXML Support: Specifications and Development
XML Support: Specifications and Development
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)
 
Php
PhpPhp
Php
 
PostgreSQL and XML
PostgreSQL and XMLPostgreSQL and XML
PostgreSQL and XML
 
Xml parsers
Xml parsersXml parsers
Xml parsers
 
Object Relational Mapping in PHP
Object Relational Mapping in PHPObject Relational Mapping in PHP
Object Relational Mapping in PHP
 
PhD Presentation
PhD PresentationPhD Presentation
PhD Presentation
 

Similar to AVOID DATABASE SURPRISES WITH EARLY SIMULATION

98765432345671223Intro-to-PostgreSQL.ppt
98765432345671223Intro-to-PostgreSQL.ppt98765432345671223Intro-to-PostgreSQL.ppt
98765432345671223Intro-to-PostgreSQL.pptHastavaramDineshKuma
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
 
NoSQL - "simple" web monitoring
NoSQL - "simple" web monitoringNoSQL - "simple" web monitoring
NoSQL - "simple" web monitoringSamir Siqueira
 
When to NoSQL and when to know SQL
When to NoSQL and when to know SQLWhen to NoSQL and when to know SQL
When to NoSQL and when to know SQLSimon Elliston Ball
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters
 
Graph Database Query Languages
Graph Database Query LanguagesGraph Database Query Languages
Graph Database Query LanguagesJay Coskey
 
Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)jaxLondonConference
 
Mindmap: Oracle to Couchbase for developers
Mindmap: Oracle to Couchbase for developersMindmap: Oracle to Couchbase for developers
Mindmap: Oracle to Couchbase for developersKeshav Murthy
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASRick Watts
 
json.ppt download for free for college project
json.ppt download for free for college projectjson.ppt download for free for college project
json.ppt download for free for college projectAmitSharma397241
 
SAS cheat sheet
SAS cheat sheetSAS cheat sheet
SAS cheat sheetAli Ajouz
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...confluent
 
Learn D3.js in 90 minutes
Learn D3.js in 90 minutesLearn D3.js in 90 minutes
Learn D3.js in 90 minutesJos Dirksen
 

Similar to AVOID DATABASE SURPRISES WITH EARLY SIMULATION (20)

98765432345671223Intro-to-PostgreSQL.ppt
98765432345671223Intro-to-PostgreSQL.ppt98765432345671223Intro-to-PostgreSQL.ppt
98765432345671223Intro-to-PostgreSQL.ppt
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
NoSQL - "simple" web monitoring
NoSQL - "simple" web monitoringNoSQL - "simple" web monitoring
NoSQL - "simple" web monitoring
 
When to NoSQL and when to know SQL
When to NoSQL and when to know SQLWhen to NoSQL and when to know SQL
When to NoSQL and when to know SQL
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
Graph Database Query Languages
Graph Database Query LanguagesGraph Database Query Languages
Graph Database Query Languages
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
 
Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)
 
Mindmap: Oracle to Couchbase for developers
Mindmap: Oracle to Couchbase for developersMindmap: Oracle to Couchbase for developers
Mindmap: Oracle to Couchbase for developers
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx
 
SAS Internal Training
SAS Internal TrainingSAS Internal Training
SAS Internal Training
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SAS
 
R environment
R environmentR environment
R environment
 
json.ppt download for free for college project
json.ppt download for free for college projectjson.ppt download for free for college project
json.ppt download for free for college project
 
SAS cheat sheet
SAS cheat sheetSAS cheat sheet
SAS cheat sheet
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
 
Learn D3.js in 90 minutes
Learn D3.js in 90 minutesLearn D3.js in 90 minutes
Learn D3.js in 90 minutes
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

AVOID DATABASE SURPRISES WITH EARLY SIMULATION

  • 3. WEB APP DOESN’T SCALE You’ve got a brilliant app You’ve got a brilliant cloud deployment EC2, ALB — all the right moving parts It still doesn't scale S LOW
  • 4. ANALYTICS DON’T SCALE You’ve studied the data You’ve got a model that’s hugely important It explains things It predicts things But. It’s SLOW S LOW
  • 5. YOU BLAMED PYTHON Web: NGINX, uWSGI, the proxy server, the coffee shop Analytics: Pandas, Scikit Learn, Jupyter Notebook, open office layouts And More…
  • 6. STACK OVERFLOW SAYS “PROFILE” So you profiled and you profiled And… It turns out it’s the database
  • 7. HORROR MOVIE TROPE There’s a monster And it’s in your base And it’s killing your dudes It’s Your Database
  • 8. KILLING THE MONSTER Hard work Lots of stress Many Techniques Indexing Denormalization I/O Concurrency (i.e., more devices) Compression
  • 9. CAN WE PREVENT ALL THIS? Spoiler Alert: Yes
  • 10. TO AVOID BECOMING A HORROR STORY Simulate Early Often
  • 11. A/K/A SYNTHETIC DATA Why You Don’t Necessarily Need Data for Data Science https://medium.com/capital-one-tech
  • 12. HORROR MOVIE TROPES Look behind the door Don’t run into the dark barn alone Avoid the trackless forest after dark Stay with your friends Don’t dismiss the funny noises
  • 13. DATA IS THE ONLY THING THAT MATTERS Foundational Concept
  • 14. SIDE-BAR ARGUMENT UX is important, but secondary But it’s never the kind of bottleneck app and DB servers are You will also experiment with UX I’m not saying ignore the UX Data lasts forever. Data is converted and preserved. UX is transient. Next release will have a better, more modern experience.
  • 15. SIMULATE EARLY Build your data models first Build the nucleus of your application processing Build a performance testing environment With realistic volumes of data
  • 16. SIMULATE OFTEN When your data model changes When you add features When you acquire more sources of data Rerun the nucleus of your application processing With realistic volumes of data
  • 18. WHAT YOU’LL NEED A candidate data model The nucleus of processing RESTful API CRUD elements Analytical Extract-Transform-Load-Map-Reduce steps A data generator to populate the database Measurements using time.perf_counter()
  • 19. DATA MODEL SQLAlchemy or Django ORM (or others, SQLObject, etc.) Data Classes Plain Old Python Objects (POPO) and JSON serialization If we use JSON Schema validation, we can do cool stuff
  • 20. THE PROBLEM: FLEXIBILITY SQL provides minimal type information (string, decimal, date, etc.) No ranges, enumerated values or other domain details (e.g., name vs. address) Does provide Primary Key and Foreign Key information Data classes provide more detailed type information Still doesn’t include ranges or other domain details No PK/FK help at all
  • 21. A SOLUTION Narrow type specifiations using JSON Schema Examples to follow
  • 22. class Card(Model): """ title: Card description: "Simple Playing Cards" type: object properties: suit: type: string enum: ["H", "S", "D", "C"] rank: type: integer minimum: 1 maximum: 13 """ JSON Schema Definition In YAML Notation
  • 23. HOW DOES THIS WORK? A metaclass parses the schema YAML and builds a validator An abstract superclass provides __init__() to validate the document
  • 24. import yaml import json import jsonschema class SchemaMeta(type): def __new__(mcs, name, bases, namespace): # pylint: disable=protected-access result = type.__new__(mcs, name, bases, dict(namespace)) result.SCHEMA = yaml.load(result.__doc__) jsonschema.Draft4Validator.check_schema(result.SCHEMA) result._validator = jsonschema.Draft4Validator(result.SCHEMA) return result Builds JSONSchema validator from __doc__ string
  • 25. class Model(dict, metaclass=SchemaMeta): """ title: Model description: abstract superclass for Model """ @classmethod def from_json(cls, document): return cls(yaml.load(document)) @property def json(self): return json.dumps(self) def __init__(self, *args, **kw): super().__init__(*args, **kw) if not self._validator.is_valid(self): raise TypeError(list(self._validator.iter_errors(self))) Validates object and raises TypeError
  • 26. >>> h1 = Card.from_json('{"suit": "H", "rank": 1}') >>> h1['suit'] 'H' >>> h1.json '{"suit": "H", "rank": 1}' Deserialize POPO from JSON text Serialize POPO into JSON text
  • 27. >>> d = Card.from_json('{"suit": "hearts", "rank": -12}') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 8, in from_json File "<stdin>", line 15, in __init__ TypeError: [<ValidationError: "'hearts' is not one of ['H', 'S', 'D', 'C']">, <ValidationError: '-12 is less than the minimum of 1'>] Fail to deserialize invalid POPO from JSON text
  • 28. WHY? JSON Schema allows us to provide Type (string, number, integer, boolean, array, or object) Ranges for numerics Enumerated values (for numbers or strings) Format for strings (i.e. email, uri, date-time, etc.) Text Patterns for strings (more general regular expression handling)
  • 30. AHA With JSON schema we can build simulated data
  • 31. THERE ARE SIX SCHEMA TYPES null — Always None integer — Use const, enum, minimum, maximum constraints number — Use const, enum, minimum, maximum constraints string — Use const, enum, format, or pattern constraints There are 17 defined formats to narrow the constraints array — recursively expand items to build an array object — recursively expand properties to build a document
  • 32. class Generator: def __init__(self, parent_schema, domains=None): self.schema = parent_schema def gen_null(self, schema): return None def gen_string(self, schema): … def gen_integer(self, schema): … def gen_number(self, schema): … def gen_array(self, schema): doc = [self.generate(schema.get('items')) for _ in range(lo, hi+1)] return doc def gen_object(self, schema): doc = { name: self.generate(subschema) for name, subschema in schema.get('properties', {}).items() } return doc def generate(self, schema=None): schema = schema or self.schema schema_type = schema.get('type', 'object') method = getattr(self, f"gen_{schema_type}") return method(schema) Finds gen_* methods
  • 33. def make_documents(model_class, count=100, domains=None): generator = Generator(model_class.SCHEMA, domains) docs_iter = (generator.generate() for i in range(count)) for doc in docs_iter: print(model_class(**doc)) Or write to a file Or load a database
  • 34. NOW YOU CAN SIMULATE Early Often
  • 35. WHAT ABOUT? More sophisticated data domains? Name, Address, Comments, etc. More than text. No simple format. Primary Key and Foreign Key Relationships
  • 36. HANDLING FORMATS def gen_string(self, schema): if 'const' in schema: return schema['const'] elif 'enum' in schema: return random.choice(schema['enum']) elif 'format' in schema: return FORMATS[schema['format']]() else: return "string"
  • 37. TS_RANGE = (datetime.datetime(1900, 1, 1).timestamp(), datetime.datetime(2100, 12, 31).timestamp()) FORMATS = { 'date-time': ( lambda: datetime.datetime.utcfromtimestamp( random.randrange(*TS_RANGE) ).isoformat() ), 'date': ( lambda: datetime.datetime.utcfromtimestamp( random.randrange(*TS_RANGE) ).date().isoformat() ), 'time': ( lambda: datetime.datetime.utcfromtimestamp( random.randrange(*TS_RANGE) ).time().isoformat() ),
  • 38. DATA DOMAINS String format (and enum) may not be enough to characterize data Doing Text Search or Indexing? You want text-like data Using Names or Addresses? Random strings may not be appropriate. Credit card numbers? You want 16-digit strings
  • 39. EXAMPLE DOMAIN: DIGITS def digits(n): return ''.join(random.choice('012345789') for _ in range(n))
  • 40. EXAMPLE DOMAIN: NAMES class LoremIpsum: _phrases = [ "Lorem ipsum dolor sit amet", "consectetur adipiscing elit”, …etc.… "mollis eleifend leo venenatis" ] @staticmethod def word(): return random.choice(random.choice(LoremIpsum._phrases).split()) @staticmethod def name(): return ' '.join(LoremIpsum.word() for _ in range(3)).title()
  • 41. RECAP
  • 42. HOW TO GET INTO TROUBLE Faith Have faith the best practices you read in a blog really work Assume Assume you understand best practices you read in a blog Hope Hope you will somehow avoid scaling problems
  • 43. SOME TROPES Look behind the door Don’t run into the dark barn alone Avoid the trackless forest after dark Stay with your friends Don’t dismiss the funny noises
  • 44. TO DO CHECKLIST Simulate Early and Often Define Python Classes Use JSON Schema to provide fine-grained definitions With ranges, formats, enums Build a generator to populate instances in bulk Gather Performance Data Profit