Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a Data Ingestion & Processing Pipeline with Spark & Airflow


Published on

Why we build a Data Ingestion & Processing Pipeline with Spark & Airflow @Datlinq and all the parts needed to get it all together in our big data system.

Presented 2017-02-10 at Data Driven Rijnmond Meetup:

Published in: Engineering
  • Grab 5 Free Shed Plans Now! Download 5 Full-Blown Shed Plans with Step-By-Step Instructions & Easy To Follow Blueprints! ☺☺☺
    Are you sure you want to  Yes  No
    Your message goes here

Building a Data Ingestion & Processing Pipeline with Spark & Airflow

  1. 1. B U I L D I N G A D ATA I N G E S T I O N & P R O C E S S I N G P I P E L I N E W I T H S PA R K & A I R F L O W D A TA D R I V E N R I J N M O N D Tom Lous @tomlous
  2. 2. D AT L I N Q B Y N U M B E R S W H O A R E W E ? • 14 countries
 Netherlands, Belgium, France, UK, Germany, Spain, Italy, Luxembourg • 100 customers across Europe
 Coca-Cola, Unilever, FrieslandCampina, Red Bull, Spa, AB InBev & Heineken • 2000 active users
 Working with our SaaS, insights & mobile solutions • 1.2M outlets
 Restaurants, bars, hotels, bakeries, supermarkets, kiosks, hospitals • 5000 mutations per day
 Out of business, starters, ownership change, name change, etc • 30 students
 Manually checking many changes
  3. 3. D AT L I N Q C O L L E C T S D ATA W H A T D O W E D O ? • Outlet features
 Address, segmentation, opening hours, contact info, lat/long, terrace, parking, pricing • Outlet surroundings Nearby public transportation, school, going out area, shopping, high traffic • Demographics Median income, age, density, marital status • Social Media
 Presence, traffic, likes, posts
  4. 4. D AT L I N Q ’ S M I S S I O N W H Y D O W E D O I T ? • Food service market transparent • Effective marketing & sales • Which food products are relevant where?
  5. 5. W H Y S PA R K ? S O L U T I O N • Scaling is hard
 Mo’ outlets, mo’ students • Data != information One-of analysis projects should be done continuously and immediately • Work smarter not harder Have students train AI instead of repetitive work • Faster results
 Distribute computing allows for increased processing speeds SOLUTION?
  6. 6. H O W T O S TA R T S O L U T I O N • Focus on MVP
 Minimal Viable Product that acts like a Proof of Concept • Declare riskiest assumption We can automatically match data from different sources • Prove it Small scale ML models
  8. 8. U S E T H E C L O U D B U I L D I N G T H E P I P E L I N E • Flexible usage
 Pay for what you use • Scaling out of the box The only limit seems to be your credit card • CLI > UI CLI tools are your friend. UI is limited
  9. 9. D E V E L O P L O C A L LY B U I L D I N G T H E P I P E L I N E • Check versions
 Your cloud provider sets the max version (e.g. Spark 2.0.2 instead of 2.1.0) • Same code smaller data Use samples of your data to test on • Use at least 2 threads Otherwise parallel execution cannot be truly tested • Exclude libraries from deployment Some libraries are provided on the cloud, don’t add them to your final build
  10. 10. C O N T I N U O U S I N T E G R AT I O N B U I L D I N G T H E P I P E L I N E • Deploy code
 Manual uploads or deployments is asking for trouble.
 Use polling instead of pushing for security
 Create pipeline script to stage builds
 Automated tests • Commit config Store CI integration
  11. 11. S PA R K J O B S B U I L D I N G T H E P I P E L I N E • All jobs in one project
 Unless there is a very good reason, keep all your Jobs in one file • Use Datasets as much as possible Avoid RDD’s and DataFrames • Use Case Class (encoders) as much as possible Generic Row object will only get you so far • Verbosity only in main class Log messages from nodes won’t make it to the master •Look at the execution plan Remove temp files, clean the slate
  12. 12. PA R Q U E T B U I L D I N G T H E P I P E L I N E • What is it
 Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem • Compressed Small footprint. • Columnar Easier to drill down into data based on fields • Metadata Stores meta information easy to use by spark (counts, etc) • Partitioning Allows for partitioning on values for faster access
  13. 13. E L A S T I C S E A R C H B U I L D I N G T H E P I P E L I N E • NxN matching is computationally expensive
 Querying Elasticsearch / Lucene / Solr less Fuzzy searching using n-grams and other tokenizers • Cluster out of the box Auto replication • Config differs from normal use It’s not a search feature for a website. 
 Don’t care about update performance
 Don’t load balance queries (_local) • Discovery Plugins for GC So nodes can find each other on local network without Zen Create VM’s with --scopes=compute-rw
  14. 14. A N S I B L E B U I L D I N G T H E P I P E L I N E • Automate VM’s / agent less
 Elasticsearch cluster • Dependant on Ansible GC Modules
 gcloud keeps updating. Ansible GC doesn’t directly follow suite
  15. 15. A I R F L O W B U I L D I N G T H E P I P E L I N E • Create workflows
 Define all tasks and dependencies • Cron on steroids Starts every day • Dependencies Tasks are dependent on tasks upstream • Backfill / rerun Run as if in the past
  17. 17. F I N A L LY C O N C L U S I O N • What’s next?
 Improving MVP
 Adding more machine learning
 Adding more datasources
 Adding more countries
 Improving architecture • Production Create an API for use in our products
 Create UI for exploring the data
 Exporting to standard databases to further use • Team Hire Data Engineer(s) Hire Data Scientist(s)
  18. 18. T H A N K S F O R WAT C H I N G C O N C L U S I O N • Questions?
 AMA • Vacancies! Ask me or check our site • Break Grab a drink an let’s reconvene after a 15 min break