2. About Me
• Senior Principal Engineer, Office of Technology, Red Hat
• Committer and PMC member on Apache Mahout
• Contributor to DeepLearning4J and Oryx 2.0
• Co-Organizer of Washington DC Apache Flink Meetup
• Founder of Boston Apache Flink Meetup
3. Outline Of Talk
• What is BigPetStore?
• Why BigPetStore?
• Synthetic Data
• BigPetStore - MapReduce, Spark
• BigPetStore - Flink
• Future possibilities
4. What is BigPetStore?
• Blueprints for Big Data
applications
• Consists of:
– Data Generators
– Examples using tools in
Big Data ecosystem to
process data
– Build system and tests for
integrating tools and
multiple JVM languages
• Part of Apache Bigtop
• Used for:
– Templates for infrastructure
(build, integration, testing)
– Educational examples
– Testing
– Demos
– Benchmarking
5. Why BigPetStore?(1)
As a developer, I want an application blueprint that…
• scales to a size approximating my data-domain
• includes idiomatic unit and integration testing
• demonstrates ETL as well analytics
In other words…
Word count was great for MapReduce, but we need
something more to demonstrate the advanced capabilities
of newer processing engines
6. Why BigPetStore?(2)
PetStores have been around for a while to showcase
different technologies starting with Sun’s Web Petstore in
the early days of J2EE
Everyone knows what a PetStore is, hence it’s intuitive to
non-developers
8. Vision
• Bigtop Data Generators - a resource for all Apache
projects!
• To build more sophisticated blueprints for users and
developer
• Useful for smoke testing infrastructure and applications!
9. Case for Synthetic Data
• Most company Data is private and confidential
• Licensing concerns with sharing the data
• Secure data cannot be moved out of production
• Enable more realistic example applications
• Enable more comprehensive testing than regular
wordcount or TeraSort
10. Bigtop Data Generators
• BigPetStore Data Generator
• Bigtop Weatherman
• Bigtop Bazaar
• Locations Library
• Sampler Library
• Name Generator
• Product Generator
11. BigPetStore-Mapreduce (BIGTOP-1270)
• Originally, a MapReduce
application for demonstrating
Mapreduce, Pig, Mahout.
• Primitive “hierarchical” data
generator for generating fake
petstore transaction (at any scale).
• Part of ASF Bigtop and at Red
Hat, and other companies, for
testing the Hadoop ecosystem.
12. New Data Generator for BigPetStore
• Motivation: realistic ML/analytics examples
• Goal: More complex patterns embedded in data
• Mathematical modeling and simulation
– Sampling from PDFs
– (Hidden) Markov Models
– Poisson processes
– Stochastic differential equations
13. Next Step: A Platform Independent Data
Generator.
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud
Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
14. BigPetStore Data Model
• Generative Model leveraging well-known mathematical
modeling techniques to simulate factors influencing
customers’ purchasing habits.
• Several cases real data is used to parameterize the model
16. BigPetStore-TransactionQueue
• no need for API calls, just use docker
• Generate load for any app: Not just JVM apps.
• docker run -t -i smarthi/bigpetstore-transaction-queue
17. BigPetStore-Spark (BIGTOP-1535)
-RJ Nowling rewrote the BigPetStore
data generator components to generate
more complex data sets, with patterns
varying in many dimensions.
-BigPetStore-Spark was then added to
ASF BigTop, demonstrating that the
data generator could be used in a
distributed context.
18. BigPetStore-Flink (Bigtop-1927 & Bigtop-1928)
• A Flink application blueprint.
• Generates data at any scale.
• Uses Flink streams to write generated data to disk.
• Uses Flink DataStream transformations to transform data
sets for analytics.
20. Future Endeavors
• How to help users build their own models?
• How to use the Bigtop Data Generators for load testing?
• How to produce synthetic copies from real datasets?
• Better libraries and abstractions to reduce boilerplate
• Research: Investigating Probabilistic Programming
Languages which provide advanced sampling and
inference algorithms combined with high-level DSLs for
model specifications
21. Future: BigPetStore - Flink
A BigPetStore Blueprint for:
• Flink Batch
• Flink Table API
• Flink ML algorithms
22. Resources
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet
Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth
International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
https://github.com/apache/bigtop/tree/master/bigtop-data-generators
https://github.com/apache/bigtop/tree/master/bigtop-bigpetstore
BigTop Data Generators available as a library:
http://dl.bintray.com/rnowling/bigpetstore
23. TL;DR
• BigTop Data Generators - a resource for all Apache BigData projects
• Comprehensive Blueprints
• Smoke and integration testing
• Load testing
• Flink BigPetstore soon to be part of Apache Bigtop (BIGTOP-1927 &
BIGTOP-1928)
• Future Endeavors
• Expand BigPetStore Flink as new Flink features become available
• Make models easier to build
• Easier ways to generate synthetic data from models built on real data