AVOID DATABASE SURPRISES WITH EARLY SIMULATION

AVOIDING BAD DATABASE
SURPRISES
Simulation and Scalability

WEB APP DOESN’T SCALE
You’ve got a brilliant app
You’ve got a brilliant cloud deployment
EC2, ALB — all the right moving parts
It still doesn't scale
S LOW

ANALYTICS DON’T SCALE
You’ve studied the data
You’ve got a model that’s hugely important
It explains things
It predicts things
But. It’s SLOW S LOW

YOU BLAMED PYTHON
Web:
NGINX,
uWSGI,
the proxy server,
the coﬀee shop
Analytics:
Pandas,
Scikit Learn,
Jupyter Notebook,
open oﬃce layouts
And More…

STACK OVERFLOW SAYS “PROFILE”
So you proﬁled
and you proﬁled
And…
It turns out it’s the database

HORROR MOVIE TROPE
There’s a monster
And it’s in your base
And it’s killing your dudes
It’s Your Database

KILLING THE MONSTER
Hard work
Lots of stress
Many Techniques
Indexing
Denormalization
I/O Concurrency (i.e., more devices)
Compression

CAN WE PREVENT ALL THIS?
Spoiler Alert: Yes

TO AVOID BECOMING A HORROR STORY
Simulate
Early
Often

A/K/A SYNTHETIC DATA
Why You Don’t Necessarily Need Data for Data Science
https://medium.com/capital-one-tech

HORROR MOVIE TROPES
Look behind the door
Don’t run into the dark barn alone
Avoid the trackless forest after dark
Stay with your friends
Don’t dismiss the funny noises

DATA IS THE ONLY THING
THAT MATTERS
Foundational Concept

SIDE-BAR ARGUMENT
UX is important, but secondary
But it’s never the kind of bottleneck app and DB servers are
You will also experiment with UX
I’m not saying ignore the UX
Data lasts forever. Data is converted and preserved.
UX is transient. Next release will have a better, more modern experience.

SIMULATE EARLY
Build your data models ﬁrst
Build the nucleus of your application processing
Build a performance testing environment
With realistic volumes of data

SIMULATE OFTEN
When your data model changes
When you add features
When you acquire more sources of data
Rerun the nucleus of your application processing
With realistic volumes of data

WHAT YOU’LL NEED
A candidate data model
The nucleus of processing
RESTful API CRUD elements
Analytical Extract-Transform-Load-Map-Reduce steps
A data generator to populate the database
Measurements using time.perf_counter()

DATA MODEL
SQLAlchemy or Django ORM (or others, SQLObject, etc.)
Data Classes
Plain Old Python Objects (POPO) and JSON serialization
If we use JSON Schema validation, we can do cool stuﬀ

THE PROBLEM: FLEXIBILITY
SQL provides minimal type information (string, decimal, date, etc.)
No ranges, enumerated values or other domain details (e.g., name vs. address)
Does provide Primary Key and Foreign Key information
Data classes provide more detailed type information
Still doesn’t include ranges or other domain details
No PK/FK help at all

A SOLUTION
Narrow type speciﬁations using JSON Schema
Examples to follow

class Card(Model):
"""
title: Card
description: "Simple Playing Cards"
type: object
properties:
suit:
type: string
enum: ["H", "S", "D", "C"]
rank:
type: integer
minimum: 1
maximum: 13
"""
JSON Schema Deﬁnition
In YAML Notation

HOW DOES THIS WORK?
A metaclass parses the schema YAML and builds a validator
An abstract superclass provides __init__() to validate the
document

import yaml
import json
import jsonschema
class SchemaMeta(type):
def __new__(mcs, name, bases, namespace):
# pylint: disable=protected-access
result = type.__new__(mcs, name, bases, dict(namespace))
result.SCHEMA = yaml.load(result.__doc__)
jsonschema.Draft4Validator.check_schema(result.SCHEMA)
result._validator = jsonschema.Draft4Validator(result.SCHEMA)
return result
Builds JSONSchema validator
from __doc__ string

class Model(dict, metaclass=SchemaMeta):
"""
title: Model
description: abstract superclass for Model
"""
@classmethod
def from_json(cls, document):
return cls(yaml.load(document))
@property
def json(self):
return json.dumps(self)
def __init__(self, *args, **kw):
super().__init__(*args, **kw)
if not self._validator.is_valid(self):
raise TypeError(list(self._validator.iter_errors(self)))
Validates object and raises TypeError

>>> h1 = Card.from_json('{"suit": "H", "rank": 1}')
>>> h1['suit']
'H'
>>> h1.json
'{"suit": "H", "rank": 1}'
Deserialize POPO from JSON text
Serialize POPO into JSON text

>>> d = Card.from_json('{"suit": "hearts", "rank": -12}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 8, in from_json
File "<stdin>", line 15, in __init__
TypeError: [<ValidationError: "'hearts' is not one of
['H', 'S', 'D', 'C']">, <ValidationError: '-12 is less
than the minimum of 1'>]
Fail to deserialize invalid POPO from JSON text

WHY?
JSON Schema allows us to provide
Type (string, number, integer, boolean, array, or object)
Ranges for numerics
Enumerated values (for numbers or strings)
Format for strings (i.e. email, uri, date-time, etc.)
Text Patterns for strings (more general regular expression handling)

AHA
With JSON schema we can build simulated data

THERE ARE SIX SCHEMA TYPES
null — Always None
integer — Use const, enum, minimum, maximum constraints
number — Use const, enum, minimum, maximum constraints
string — Use const, enum, format, or pattern constraints
There are 17 deﬁned formats to narrow the constraints
array — recursively expand items to build an array
object — recursively expand properties to build a document

class Generator:
def __init__(self, parent_schema, domains=None):
self.schema = parent_schema
def gen_null(self, schema):
return None
def gen_string(self, schema): …
def gen_integer(self, schema): …
def gen_number(self, schema): …
def gen_array(self, schema):
doc = [self.generate(schema.get('items')) for _ in range(lo, hi+1)]
return doc
def gen_object(self, schema):
doc = {
name: self.generate(subschema)
for name, subschema in schema.get('properties', {}).items()
}
return doc
def generate(self, schema=None):
schema = schema or self.schema
schema_type = schema.get('type', 'object')
method = getattr(self, f"gen_{schema_type}")
return method(schema)
Finds gen_* methods

def make_documents(model_class, count=100, domains=None):
generator = Generator(model_class.SCHEMA, domains)
docs_iter = (generator.generate() for i in range(count))
for doc in docs_iter:
print(model_class(**doc))
Or write to a ﬁle
Or load a database

NOW YOU CAN SIMULATE
Early Often

WHAT ABOUT?
More sophisticated data domains?
Name, Address, Comments, etc.
More than text. No simple format.
Primary Key and Foreign Key Relationships

HANDLING FORMATS
def gen_string(self, schema):
if 'const' in schema:
return schema['const']
elif 'enum' in schema:
return random.choice(schema['enum'])
elif 'format' in schema:
return FORMATS[schema['format']]()
else:
return "string"

TS_RANGE = (datetime.datetime(1900, 1, 1).timestamp(),
datetime.datetime(2100, 12, 31).timestamp())
FORMATS = {
'date-time': (
lambda: datetime.datetime.utcfromtimestamp(
random.randrange(*TS_RANGE)
).isoformat()
),
'date': (
).date().isoformat()
),
'time': (
).time().isoformat()
),

DATA DOMAINS
String format (and enum) may not be enough to characterize data
Doing Text Search or Indexing? You want text-like data
Using Names or Addresses? Random strings may not be
appropriate.
Credit card numbers? You want 16-digit strings

EXAMPLE DOMAIN: DIGITS
def digits(n):
return ''.join(random.choice('012345789') for _ in range(n))

EXAMPLE DOMAIN: NAMES
class LoremIpsum:
_phrases = [
"Lorem ipsum dolor sit amet",
"consectetur adipiscing elit”,
…etc.…
"mollis eleifend leo venenatis"
]
@staticmethod
def word():
return
random.choice(random.choice(LoremIpsum._phrases).split())
@staticmethod
def name():
return ' '.join(LoremIpsum.word() for _ in range(3)).title()

HOW TO GET INTO TROUBLE
Faith
Have faith the best practices you read in a blog really work
Assume
Assume you understand best practices you read in a blog
Hope
Hope you will somehow avoid scaling problems

SOME TROPES
Look behind the door
Don’t run into the dark barn alone
Avoid the trackless forest after dark
Stay with your friends
Don’t dismiss the funny noises

TO DO CHECKLIST
Simulate Early and Often
Define Python Classes
Use JSON Schema to provide fine-grained definitions
With ranges, formats, enums
Build a generator to populate instances in bulk
Gather Performance Data
Profit

AVOID DATABASE SURPRISES WITH EARLY SIMULATION

AVOID DATABASE SURPRISES WITH EARLY SIMULATION

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AVOID DATABASE SURPRISES WITH EARLY SIMULATION

Similar to AVOID DATABASE SURPRISES WITH EARLY SIMULATION (20)

More from PyData

More from PyData (20)

Recently uploaded

Recently uploaded (20)

AVOID DATABASE SURPRISES WITH EARLY SIMULATION