Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott

There are many stories of developers creating databases that don't operate at scale. The application is good, but the database won't work the realistic volumes of data. It's like a horror movie where they never looked behind the door, ran into the dark forest and night, and discovered the database was the monster killing their application. How can we leverage Python to avoid scaling problems?

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott

  1. 1. AVOIDING BAD DATABASE SURPRISES Simulation and Scalability
  3. 3. WEB APP DOESN’T SCALE You’ve got a brilliant app You’ve got a brilliant cloud deployment EC2, ALB — all the right moving parts It still doesn't scale S LOW
  4. 4. ANALYTICS DON’T SCALE You’ve studied the data You’ve got a model that’s hugely important It explains things It predicts things But. It’s SLOW S LOW
  5. 5. YOU BLAMED PYTHON Web: NGINX, uWSGI, the proxy server, the coffee shop Analytics: Pandas, Scikit Learn, Jupyter Notebook, open office layouts And More…
  6. 6. STACK OVERFLOW SAYS “PROFILE” So you profiled and you profiled And… It turns out it’s the database
  7. 7. HORROR MOVIE TROPE There’s a monster And it’s in your base And it’s killing your dudes It’s Your Database
  8. 8. KILLING THE MONSTER Hard work Lots of stress Many Techniques Indexing Denormalization I/O Concurrency (i.e., more devices) Compression
  9. 9. CAN WE PREVENT ALL THIS? Spoiler Alert: Yes
  10. 10. TO AVOID BECOMING A HORROR STORY Simulate Early Often
  11. 11. A/K/A SYNTHETIC DATA Why You Don’t Necessarily Need Data for Data Science
  12. 12. HORROR MOVIE TROPES Look behind the door Don’t run into the dark barn alone Avoid the trackless forest after dark Stay with your friends Don’t dismiss the funny noises
  13. 13. DATA IS THE ONLY THING THAT MATTERS Foundational Concept
  14. 14. SIDE-BAR ARGUMENT UX is important, but secondary But it’s never the kind of bottleneck app and DB servers are You will also experiment with UX I’m not saying ignore the UX Data lasts forever. Data is converted and preserved. UX is transient. Next release will have a better, more modern experience.
  15. 15. SIMULATE EARLY Build your data models first Build the nucleus of your application processing Build a performance testing environment With realistic volumes of data
  16. 16. SIMULATE OFTEN When your data model changes When you add features When you acquire more sources of data Rerun the nucleus of your application processing With realistic volumes of data
  18. 18. WHAT YOU’LL NEED A candidate data model The nucleus of processing RESTful API CRUD elements Analytical Extract-Transform-Load-Map-Reduce steps A data generator to populate the database Measurements using time.perf_counter()
  19. 19. DATA MODEL SQLAlchemy or Django ORM (or others, SQLObject, etc.) Data Classes Plain Old Python Objects (POPO) and JSON serialization If we use JSON Schema validation, we can do cool stuff
  20. 20. THE PROBLEM: FLEXIBILITY SQL provides minimal type information (string, decimal, date, etc.) No ranges, enumerated values or other domain details (e.g., name vs. address) Does provide Primary Key and Foreign Key information Data classes provide more detailed type information Still doesn’t include ranges or other domain details No PK/FK help at all
  21. 21. A SOLUTION Narrow type specifiations using JSON Schema Examples to follow
  22. 22. class Card(Model): """ title: Card description: "Simple Playing Cards" type: object properties: suit: type: string enum: ["H", "S", "D", "C"] rank: type: integer minimum: 1 maximum: 13 """ JSON Schema Definition In YAML Notation
  23. 23. HOW DOES THIS WORK? A metaclass parses the schema YAML and builds a validator An abstract superclass provides __init__() to validate the document
  24. 24. import yaml import json import jsonschema class SchemaMeta(type): def __new__(mcs, name, bases, namespace): # pylint: disable=protected-access result = type.__new__(mcs, name, bases, dict(namespace)) result.SCHEMA = yaml.load(result.__doc__) jsonschema.Draft4Validator.check_schema(result.SCHEMA) result._validator = jsonschema.Draft4Validator(result.SCHEMA) return result Builds JSONSchema validator from __doc__ string
  25. 25. class Model(dict, metaclass=SchemaMeta): """ title: Model description: abstract superclass for Model """ @classmethod def from_json(cls, document): return cls(yaml.load(document)) @property def json(self): return json.dumps(self) def __init__(self, *args, **kw): super().__init__(*args, **kw) if not self._validator.is_valid(self): raise TypeError(list(self._validator.iter_errors(self))) Validates object and raises TypeError
  26. 26. >>> h1 = Card.from_json('{"suit": "H", "rank": 1}') >>> h1['suit'] 'H' >>> h1.json '{"suit": "H", "rank": 1}' Deserialize POPO from JSON text Serialize POPO into JSON text
  27. 27. >>> d = Card.from_json('{"suit": "hearts", "rank": -12}') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 8, in from_json File "<stdin>", line 15, in __init__ TypeError: [<ValidationError: "'hearts' is not one of ['H', 'S', 'D', 'C']">, <ValidationError: '-12 is less than the minimum of 1'>] Fail to deserialize invalid POPO from JSON text
  28. 28. WHY? JSON Schema allows us to provide Type (string, number, integer, boolean, array, or object) Ranges for numerics Enumerated values (for numbers or strings) Format for strings (i.e. email, uri, date-time, etc.) Text Patterns for strings (more general regular expression handling)
  30. 30. AHA With JSON schema we can build simulated data
  31. 31. THERE ARE SIX SCHEMA TYPES null — Always None integer — Use const, enum, minimum, maximum constraints number — Use const, enum, minimum, maximum constraints string — Use const, enum, format, or pattern constraints There are 17 defined formats to narrow the constraints array — recursively expand items to build an array object — recursively expand properties to build a document
  32. 32. class Generator: def __init__(self, parent_schema, domains=None): self.schema = parent_schema def gen_null(self, schema): return None def gen_string(self, schema): … def gen_integer(self, schema): … def gen_number(self, schema): … def gen_array(self, schema): doc = [self.generate(schema.get('items')) for _ in range(lo, hi+1)] return doc def gen_object(self, schema): doc = { name: self.generate(subschema) for name, subschema in schema.get('properties', {}).items() } return doc def generate(self, schema=None): schema = schema or self.schema schema_type = schema.get('type', 'object') method = getattr(self, f"gen_{schema_type}") return method(schema) Finds gen_* methods
  33. 33. def make_documents(model_class, count=100, domains=None): generator = Generator(model_class.SCHEMA, domains) docs_iter = (generator.generate() for i in range(count)) for doc in docs_iter: print(model_class(**doc)) Or write to a file Or load a database
  34. 34. NOW YOU CAN SIMULATE Early Often
  35. 35. WHAT ABOUT? More sophisticated data domains? Name, Address, Comments, etc. More than text. No simple format. Primary Key and Foreign Key Relationships
  36. 36. HANDLING FORMATS def gen_string(self, schema): if 'const' in schema: return schema['const'] elif 'enum' in schema: return random.choice(schema['enum']) elif 'format' in schema: return FORMATS[schema['format']]() else: return "string"
  37. 37. TS_RANGE = (datetime.datetime(1900, 1, 1).timestamp(), datetime.datetime(2100, 12, 31).timestamp()) FORMATS = { 'date-time': ( lambda: datetime.datetime.utcfromtimestamp( random.randrange(*TS_RANGE) ).isoformat() ), 'date': ( lambda: datetime.datetime.utcfromtimestamp( random.randrange(*TS_RANGE) ).date().isoformat() ), 'time': ( lambda: datetime.datetime.utcfromtimestamp( random.randrange(*TS_RANGE) ).time().isoformat() ),
  38. 38. DATA DOMAINS String format (and enum) may not be enough to characterize data Doing Text Search or Indexing? You want text-like data Using Names or Addresses? Random strings may not be appropriate. Credit card numbers? You want 16-digit strings
  39. 39. EXAMPLE DOMAIN: DIGITS def digits(n): return ''.join(random.choice('012345789') for _ in range(n))
  40. 40. EXAMPLE DOMAIN: NAMES class LoremIpsum: _phrases = [ "Lorem ipsum dolor sit amet", "consectetur adipiscing elit”, …etc.… "mollis eleifend leo venenatis" ] @staticmethod def word(): return random.choice(random.choice(LoremIpsum._phrases).split()) @staticmethod def name(): return ' '.join(LoremIpsum.word() for _ in range(3)).title()
  41. 41. RECAP
  42. 42. HOW TO GET INTO TROUBLE Faith Have faith the best practices you read in a blog really work Assume Assume you understand best practices you read in a blog Hope Hope you will somehow avoid scaling problems
  43. 43. SOME TROPES Look behind the door Don’t run into the dark barn alone Avoid the trackless forest after dark Stay with your friends Don’t dismiss the funny noises
  44. 44. TO DO CHECKLIST Simulate Early and Often Define Python Classes Use JSON Schema to provide fine-grained definitions With ranges, formats, enums Build a generator to populate instances in bulk Gather Performance Data Profit