SlideShare a Scribd company logo
1 of 15
Download to read offline
Gobblin @ NerdWallet
By Akshay Nanavati and Eric Ogren
akshay@nerdwallet.com eric@nerdwallet.com
Agenda
● Introduction to NerdWallet
● Gobblin @ NerdWallet Today
● Initial Pain Points & Learnings
● Contributions (Present and Future)
● Future Use Cases & Requests
2
What Is NerdWallet?
● Started in 2009. 275+ employees
● Highly profitable. Series A funding Feb 2015.
● We want to bring clarity to life’s financial decisions.
3
Front-End
Services Tier
NerdWallet Tech Stack
Data Analytics
Data Systems & Platforms
4
Data Types @ NerdWallet
● Partner Offer Data (MySQL & ElasticSearch: heavy reads, rare writes)
○ Synced to Redshift periodically
● Consumer Identity Data (Postgres: medium reads, medium writes)
● Site Generated Tracking Data (Redshift: heavy reads, heavy writes)
● Operational Data (e.g. Nginx logs) (Redshift: low reads, heavy writes) ?
● Internal Business Data (e.g. Salesforce) (Redshift: medium reads, rare writes)
● External 3rd Party Analytics Data (Redshift: medium reads, batch import)
5
Gobblin @ NW Today
● Running in standalone mode
● Ingests user tracking and operational log data
● Tracking Data:
○ ~10 Kafka topics - 1 per event & schema type
○ Hourly Gobblin jobs pull from kafka and dump to date-partitioned directory in S3
○ Events are already serialized as protobuf in each Kafka topic
○ Around 100 events/second
● Log Ingestion (Operational Data):
○ Extracts data from AWS logs sitting in S3
○ Parses log lines and serializes them to protobuf
○ Writes the serialized protobuf files back to S3 and eventually into redshift
6
Tracking Pipeline
7
Learnings: Deploying Gobblin w/Internal Code
● Have a repo of internal Gobblin modules (this is where we compile everything)
● Modified the build script to link the gobblin project to our gobblin-modules
project
● Use jenkins to compile gobblin on the remote machine
● Maintain a separate repository with .pull files that we can sync with our stage
and production environments
8
Current Contributions
● Simple Data Writer
○ class gobblin.writer.SimpleDataWriter
○ Writes binary record as bytes with no regard to encoding
○ Optionally prepends records by record size or uses a char delimiter at the end of records (i.e. n
for string data)
● Kafka Simple Extractor
○ class gobblin.source.extractor.extract.kafka.KafkaSimpleExtractor
○ class gobblin.source.extractor.extract.kafka.KafkaSimpleSource
○ Extracts binary data from Kafka as an array of bytes without any serde
9
Future Contributions
● Gobblin Dashboards
● S3 Source & Extractor
○ Given an S3 bucket, extract all files matching a regex
■ Leverages FileBasedExtractor
■ We would also like to modify this to have similar functionality to
DatePartitionedDailyAvroSource
● S3 Publisher
○ Publishes files to S3
○ Currently there is an issue where the AWS S3 Java API doesn’t work correctly with HDFS; since
we are running in standalone this is not an issue for us
10
Future: Dashboards
11
Gobblin @ NW tomorrow
● More data types
○ Offer data from partners: JSON/CSV/XML over {HTTP, FTP} => S3
○ Offer data from our site: MySQL => S3 (batch and incremental)
○ Identity data from out site: Postgres => S3 (batch and incremental, data hiding)
○ Salesforce Data
● Integration with Airflow DAGs
● Integration with data cleansing & entity matching frameworks
12
Early Adoption Pain Points & Solutions
● Best practices around for ingestion w/ transformation steps
● Initial problems integrating NW specific code (especially extractors &
converters) into Gobblin’s build process
● Best practices around scheduler integration - Quartz (built-in) vs ETL
schedulers
● Backwards incompatible changes caused us to make migrations to upgrade
versions
● No changelogs & tagged releases
13
Things we would like to see/add in future
● Abstract out Avro specific code
● Best practices for scheduler integration (can contribute for Airflow)
● Clustering without requiring Hadoop & YARN
● Metadata support (job X produced files Y,Z)
● Release notes & tags :)
● The build & unit test process is very bloated
○ Hard to differentiate warnings/stack traces vs legitimate build issues
○ Opens ports, creates temporary dbs, etc which make it difficult to test on arbitrary machines
(port collisions)
14
Thanks!
Questions??
15

More Related Content

What's hot

Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaDataWorks Summit
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Databricks
 
Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Databricks
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionDataStax Academy
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at PinterestQubole
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Databricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHParis Data Engineers !
 
Building scalable data with kafka and spark
Building scalable data with kafka and sparkBuilding scalable data with kafka and spark
Building scalable data with kafka and sparkbabatunde ekemode
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberXiang Fu
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineDataWorks Summit
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?FlyData Inc.
 

What's hot (20)

Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
 
Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Graph ql and enterprise
Graph ql and enterpriseGraph ql and enterprise
Graph ql and enterprise
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
 
Building scalable data with kafka and spark
Building scalable data with kafka and sparkBuilding scalable data with kafka and spark
Building scalable data with kafka and spark
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Cloud dwh
Cloud dwhCloud dwh
Cloud dwh
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?
 

Viewers also liked

Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.CPAex
 
Как получить и использовать знания о клиентах! Bablometer.ru
Как получить и использовать знания о клиентах!  Bablometer.ruКак получить и использовать знания о клиентах!  Bablometer.ru
Как получить и использовать знания о клиентах! Bablometer.ruRafail Galiev
 
Как собрать команду мечты
Как собрать команду мечтыКак собрать команду мечты
Как собрать команду мечтыSQALab
 
Agile days 2015. Непрерывное качество в непрерывной разработке
Agile days 2015. Непрерывное качество в непрерывной разработкеAgile days 2015. Непрерывное качество в непрерывной разработке
Agile days 2015. Непрерывное качество в непрерывной разработкеMaxim Boguslavsky
 
работа в крупной компании на примере Banki.ru
работа в крупной компании на примере Banki.ruработа в крупной компании на примере Banki.ru
работа в крупной компании на примере Banki.ruYuri Afanasiev
 
Talent 2020 Webinar Will Your High Potentials Stick Around
Talent 2020 Webinar Will Your High Potentials Stick AroundTalent 2020 Webinar Will Your High Potentials Stick Around
Talent 2020 Webinar Will Your High Potentials Stick AroundTalentPlus Inc
 
Nectarin Digital Digest №5
Nectarin Digital Digest №5Nectarin Digital Digest №5
Nectarin Digital Digest №5Nectarin
 
Как маркетплейсы произведут революцию на рынке электронного ОСАГО
Как маркетплейсы произведут революцию на рынке электронного ОСАГОКак маркетплейсы произведут революцию на рынке электронного ОСАГО
Как маркетплейсы произведут революцию на рынке электронного ОСАГОInstitute of development of the Internet
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Lin Qiao
 
Newsbrands and finance
Newsbrands and financeNewsbrands and finance
Newsbrands and financeNewsworks
 
Mail.ru: Как вырастить в себе автоматизатора и разработчика
Mail.ru:  Как вырастить в себе автоматизатора и разработчикаMail.ru:  Как вырастить в себе автоматизатора и разработчика
Mail.ru: Как вырастить в себе автоматизатора и разработчикаMaxim Boguslavsky
 
Optimising your rtb infrastructure sammy austin - follow up
Optimising your rtb infrastructure   sammy austin - follow upOptimising your rtb infrastructure   sammy austin - follow up
Optimising your rtb infrastructure sammy austin - follow upad:tech London
 
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...CA API Management
 
MoneySuperMarket.com: Case Study
MoneySuperMarket.com: Case StudyMoneySuperMarket.com: Case Study
MoneySuperMarket.com: Case StudyNewsworks
 
Почему почта не работает
Почему почта не работаетПочему почта не работает
Почему почта не работаетRina Uzhevko
 
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...CA API Management
 
Идеальный тестдизайн: от Цема Канера до Agile
Идеальный тестдизайн: от Цема Канера до AgileИдеальный тестдизайн: от Цема Канера до Agile
Идеальный тестдизайн: от Цема Канера до AgileCodeFest
 
Внедрение измениений. Рефакторинг Vs реинжиниринг
Внедрение измениений. Рефакторинг Vs реинжинирингВнедрение измениений. Рефакторинг Vs реинжиниринг
Внедрение измениений. Рефакторинг Vs реинжинирингRina Uzhevko
 

Viewers also liked (20)

Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
Banki.ru Иван Ильин: Емейл-маркетинг в банковской лидогенерации.
 
Как получить и использовать знания о клиентах! Bablometer.ru
Как получить и использовать знания о клиентах!  Bablometer.ruКак получить и использовать знания о клиентах!  Bablometer.ru
Как получить и использовать знания о клиентах! Bablometer.ru
 
Как собрать команду мечты
Как собрать команду мечтыКак собрать команду мечты
Как собрать команду мечты
 
Agile days 2015. Непрерывное качество в непрерывной разработке
Agile days 2015. Непрерывное качество в непрерывной разработкеAgile days 2015. Непрерывное качество в непрерывной разработке
Agile days 2015. Непрерывное качество в непрерывной разработке
 
работа в крупной компании на примере Banki.ru
работа в крупной компании на примере Banki.ruработа в крупной компании на примере Banki.ru
работа в крупной компании на примере Banki.ru
 
Talent 2020 Webinar Will Your High Potentials Stick Around
Talent 2020 Webinar Will Your High Potentials Stick AroundTalent 2020 Webinar Will Your High Potentials Stick Around
Talent 2020 Webinar Will Your High Potentials Stick Around
 
Angular meetup
Angular meetupAngular meetup
Angular meetup
 
Nectarin Digital Digest №5
Nectarin Digital Digest №5Nectarin Digital Digest №5
Nectarin Digital Digest №5
 
Как маркетплейсы произведут революцию на рынке электронного ОСАГО
Как маркетплейсы произведут революцию на рынке электронного ОСАГОКак маркетплейсы произведут революцию на рынке электронного ОСАГО
Как маркетплейсы произведут революцию на рынке электронного ОСАГО
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
 
Newsbrands and finance
Newsbrands and financeNewsbrands and finance
Newsbrands and finance
 
Mail.ru: Как вырастить в себе автоматизатора и разработчика
Mail.ru:  Как вырастить в себе автоматизатора и разработчикаMail.ru:  Как вырастить в себе автоматизатора и разработчика
Mail.ru: Как вырастить в себе автоматизатора и разработчика
 
Optimising your rtb infrastructure sammy austin - follow up
Optimising your rtb infrastructure   sammy austin - follow upOptimising your rtb infrastructure   sammy austin - follow up
Optimising your rtb infrastructure sammy austin - follow up
 
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...
Reduce, Reuse, Recycle: How Moneysupermarket.com Created APIs Without Startin...
 
MoneySuperMarket.com: Case Study
MoneySuperMarket.com: Case StudyMoneySuperMarket.com: Case Study
MoneySuperMarket.com: Case Study
 
Почему почта не работает
Почему почта не работаетПочему почта не работает
Почему почта не работает
 
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...
MoneySupermarket.com: Customer Case Study - Layer 7 API Management Workshop L...
 
Идеальный тестдизайн: от Цема Канера до Agile
Идеальный тестдизайн: от Цема Канера до AgileИдеальный тестдизайн: от Цема Канера до Agile
Идеальный тестдизайн: от Цема Канера до Agile
 
Pankov
PankovPankov
Pankov
 
Внедрение измениений. Рефакторинг Vs реинжиниринг
Внедрение измениений. Рефакторинг Vs реинжинирингВнедрение измениений. Рефакторинг Vs реинжиниринг
Внедрение измениений. Рефакторинг Vs реинжиниринг
 

Similar to Gobblin @ NerdWallet (Nov 2015)

Kotlin REST & GraphQL API
Kotlin REST & GraphQL APIKotlin REST & GraphQL API
Kotlin REST & GraphQL APISean O'Brien
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesDrew Hansen
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
 
BlackRay - The open Source Data Engine
BlackRay - The open Source Data EngineBlackRay - The open Source Data Engine
BlackRay - The open Source Data Enginefschupp
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009fschupp
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonHostedbyConfluent
 
Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesDoKC
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...Boško Devetak
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbWei Shan Ang
 
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...Altinity Ltd
 

Similar to Gobblin @ NerdWallet (Nov 2015) (20)

Kotlin REST & GraphQL API
Kotlin REST & GraphQL APIKotlin REST & GraphQL API
Kotlin REST & GraphQL API
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
BlackRay - The open Source Data Engine
BlackRay - The open Source Data EngineBlackRay - The open Source Data Engine
BlackRay - The open Source Data Engine
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit London
 
Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetes
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
 
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
 

Recently uploaded

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 

Recently uploaded (20)

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 

Gobblin @ NerdWallet (Nov 2015)

  • 1. Gobblin @ NerdWallet By Akshay Nanavati and Eric Ogren akshay@nerdwallet.com eric@nerdwallet.com
  • 2. Agenda ● Introduction to NerdWallet ● Gobblin @ NerdWallet Today ● Initial Pain Points & Learnings ● Contributions (Present and Future) ● Future Use Cases & Requests 2
  • 3. What Is NerdWallet? ● Started in 2009. 275+ employees ● Highly profitable. Series A funding Feb 2015. ● We want to bring clarity to life’s financial decisions. 3
  • 4. Front-End Services Tier NerdWallet Tech Stack Data Analytics Data Systems & Platforms 4
  • 5. Data Types @ NerdWallet ● Partner Offer Data (MySQL & ElasticSearch: heavy reads, rare writes) ○ Synced to Redshift periodically ● Consumer Identity Data (Postgres: medium reads, medium writes) ● Site Generated Tracking Data (Redshift: heavy reads, heavy writes) ● Operational Data (e.g. Nginx logs) (Redshift: low reads, heavy writes) ? ● Internal Business Data (e.g. Salesforce) (Redshift: medium reads, rare writes) ● External 3rd Party Analytics Data (Redshift: medium reads, batch import) 5
  • 6. Gobblin @ NW Today ● Running in standalone mode ● Ingests user tracking and operational log data ● Tracking Data: ○ ~10 Kafka topics - 1 per event & schema type ○ Hourly Gobblin jobs pull from kafka and dump to date-partitioned directory in S3 ○ Events are already serialized as protobuf in each Kafka topic ○ Around 100 events/second ● Log Ingestion (Operational Data): ○ Extracts data from AWS logs sitting in S3 ○ Parses log lines and serializes them to protobuf ○ Writes the serialized protobuf files back to S3 and eventually into redshift 6
  • 8. Learnings: Deploying Gobblin w/Internal Code ● Have a repo of internal Gobblin modules (this is where we compile everything) ● Modified the build script to link the gobblin project to our gobblin-modules project ● Use jenkins to compile gobblin on the remote machine ● Maintain a separate repository with .pull files that we can sync with our stage and production environments 8
  • 9. Current Contributions ● Simple Data Writer ○ class gobblin.writer.SimpleDataWriter ○ Writes binary record as bytes with no regard to encoding ○ Optionally prepends records by record size or uses a char delimiter at the end of records (i.e. n for string data) ● Kafka Simple Extractor ○ class gobblin.source.extractor.extract.kafka.KafkaSimpleExtractor ○ class gobblin.source.extractor.extract.kafka.KafkaSimpleSource ○ Extracts binary data from Kafka as an array of bytes without any serde 9
  • 10. Future Contributions ● Gobblin Dashboards ● S3 Source & Extractor ○ Given an S3 bucket, extract all files matching a regex ■ Leverages FileBasedExtractor ■ We would also like to modify this to have similar functionality to DatePartitionedDailyAvroSource ● S3 Publisher ○ Publishes files to S3 ○ Currently there is an issue where the AWS S3 Java API doesn’t work correctly with HDFS; since we are running in standalone this is not an issue for us 10
  • 12. Gobblin @ NW tomorrow ● More data types ○ Offer data from partners: JSON/CSV/XML over {HTTP, FTP} => S3 ○ Offer data from our site: MySQL => S3 (batch and incremental) ○ Identity data from out site: Postgres => S3 (batch and incremental, data hiding) ○ Salesforce Data ● Integration with Airflow DAGs ● Integration with data cleansing & entity matching frameworks 12
  • 13. Early Adoption Pain Points & Solutions ● Best practices around for ingestion w/ transformation steps ● Initial problems integrating NW specific code (especially extractors & converters) into Gobblin’s build process ● Best practices around scheduler integration - Quartz (built-in) vs ETL schedulers ● Backwards incompatible changes caused us to make migrations to upgrade versions ● No changelogs & tagged releases 13
  • 14. Things we would like to see/add in future ● Abstract out Avro specific code ● Best practices for scheduler integration (can contribute for Airflow) ● Clustering without requiring Hadoop & YARN ● Metadata support (job X produced files Y,Z) ● Release notes & tags :) ● The build & unit test process is very bloated ○ Hard to differentiate warnings/stack traces vs legitimate build issues ○ Opens ports, creates temporary dbs, etc which make it difficult to test on arbitrary machines (port collisions) 14