Machine Learning Software Engineering Patterns and Their Engineering
Gobblin @ NerdWallet (Nov 2015)
1. Gobblin @ NerdWallet
By Akshay Nanavati and Eric Ogren
akshay@nerdwallet.com eric@nerdwallet.com
2. Agenda
● Introduction to NerdWallet
● Gobblin @ NerdWallet Today
● Initial Pain Points & Learnings
● Contributions (Present and Future)
● Future Use Cases & Requests
2
3. What Is NerdWallet?
● Started in 2009. 275+ employees
● Highly profitable. Series A funding Feb 2015.
● We want to bring clarity to life’s financial decisions.
3
5. Data Types @ NerdWallet
● Partner Offer Data (MySQL & ElasticSearch: heavy reads, rare writes)
○ Synced to Redshift periodically
● Consumer Identity Data (Postgres: medium reads, medium writes)
● Site Generated Tracking Data (Redshift: heavy reads, heavy writes)
● Operational Data (e.g. Nginx logs) (Redshift: low reads, heavy writes) ?
● Internal Business Data (e.g. Salesforce) (Redshift: medium reads, rare writes)
● External 3rd Party Analytics Data (Redshift: medium reads, batch import)
5
6. Gobblin @ NW Today
● Running in standalone mode
● Ingests user tracking and operational log data
● Tracking Data:
○ ~10 Kafka topics - 1 per event & schema type
○ Hourly Gobblin jobs pull from kafka and dump to date-partitioned directory in S3
○ Events are already serialized as protobuf in each Kafka topic
○ Around 100 events/second
● Log Ingestion (Operational Data):
○ Extracts data from AWS logs sitting in S3
○ Parses log lines and serializes them to protobuf
○ Writes the serialized protobuf files back to S3 and eventually into redshift
6
8. Learnings: Deploying Gobblin w/Internal Code
● Have a repo of internal Gobblin modules (this is where we compile everything)
● Modified the build script to link the gobblin project to our gobblin-modules
project
● Use jenkins to compile gobblin on the remote machine
● Maintain a separate repository with .pull files that we can sync with our stage
and production environments
8
9. Current Contributions
● Simple Data Writer
○ class gobblin.writer.SimpleDataWriter
○ Writes binary record as bytes with no regard to encoding
○ Optionally prepends records by record size or uses a char delimiter at the end of records (i.e. n
for string data)
● Kafka Simple Extractor
○ class gobblin.source.extractor.extract.kafka.KafkaSimpleExtractor
○ class gobblin.source.extractor.extract.kafka.KafkaSimpleSource
○ Extracts binary data from Kafka as an array of bytes without any serde
9
10. Future Contributions
● Gobblin Dashboards
● S3 Source & Extractor
○ Given an S3 bucket, extract all files matching a regex
■ Leverages FileBasedExtractor
■ We would also like to modify this to have similar functionality to
DatePartitionedDailyAvroSource
● S3 Publisher
○ Publishes files to S3
○ Currently there is an issue where the AWS S3 Java API doesn’t work correctly with HDFS; since
we are running in standalone this is not an issue for us
10
12. Gobblin @ NW tomorrow
● More data types
○ Offer data from partners: JSON/CSV/XML over {HTTP, FTP} => S3
○ Offer data from our site: MySQL => S3 (batch and incremental)
○ Identity data from out site: Postgres => S3 (batch and incremental, data hiding)
○ Salesforce Data
● Integration with Airflow DAGs
● Integration with data cleansing & entity matching frameworks
12
13. Early Adoption Pain Points & Solutions
● Best practices around for ingestion w/ transformation steps
● Initial problems integrating NW specific code (especially extractors &
converters) into Gobblin’s build process
● Best practices around scheduler integration - Quartz (built-in) vs ETL
schedulers
● Backwards incompatible changes caused us to make migrations to upgrade
versions
● No changelogs & tagged releases
13
14. Things we would like to see/add in future
● Abstract out Avro specific code
● Best practices for scheduler integration (can contribute for Airflow)
● Clustering without requiring Hadoop & YARN
● Metadata support (job X produced files Y,Z)
● Release notes & tags :)
● The build & unit test process is very bloated
○ Hard to differentiate warnings/stack traces vs legitimate build issues
○ Opens ports, creates temporary dbs, etc which make it difficult to test on arbitrary machines
(port collisions)
14