Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Marketing Cloud

Tal Sharon (Software Architect), Aviel Buskila (DevOps Engineer) and Max Peres (Data Engineer) @ Nielsen:
At the Nielsen Marketing Cloud, we used to manage our data pipelines via AWS Data Pipeline. Over the years, we’ve encountered several issues with this tool, and a year ago we decided to embark on a journey to replace it with a tool more suitable for our needs.
In this session, we’ll discuss how we actually migrated to Airflow, what challenges we faced and how we mitigated them (and even contributed to the open-source project along the way). We’ll also provide some helpful tips for Airflow users

  • Login to see the comments

From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Marketing Cloud

  1. 1. Copyright © 2017 The Nielsen Company (US), LLC. Confidential and proprietary. Do not distribute. The Big Web Theory The Big Web Theory
  2. 2. 2 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. AIRFLOWAIRFLOW
  3. 3. 3 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. ● Tal Sharon - Software Architect ● Aviel Buskila - DevOps Tech Lead ● Max Peres - Data Engineer Who we are?
  4. 4. 4 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. What will you learn today ? • Airflow and how it solved our problems • How you can deploy Airflow to production • Best practices for annoying data problems
  5. 5. 5 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Nielsen Marketing Cloud ● eXelate - acquired by Nielsen in 2015 ● Marketing data cloud service ● Creating targeting profiles ● VERY BIG DATA ● Machine learning
  6. 6. 6 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Nielsen Marketing Cloud - main challenge How many unique users of a certain profile can we reach? e.g. campaign for young women who love tech
  7. 7. 7 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. What we had...
  8. 8. 8 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Data pipeline UI Multiple workflow parameters are missing - duration, successes & failures, ..
  9. 9. 9 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. And this is how we had to configure it ... Not all configurations are visible The Workflow is hidden
  10. 10. 10 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. We wanted something else.. ● Configuration and workflow visibility ● Monitoring and statistics ● Share common configuration/code between our workflows ● Ability to have only 1 concurrent execution of a workflow
  11. 11. 11 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. The Alternatives Community Main Purpose Flow Definition UI Auto scheduling Smart Scheduling Airflow (Apache) Very Active General Purpose Python Rich V V Luigi (Spotify) Active General Purpose Python Limited X X Oozie (Apache) Active Hadoop Job Scheduling XML Limited V X Azkaban (LinkedIn) Not very active Hadoop Job Scheduling Custom DSL Rich V X
  12. 12. 12 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. What is Airflow ? ‘A platform to programmatically author, schedule, and monitor workflows’ Each workflow is described by a DAG(Directed Acyclic Graph) which is constructed by - ● Operators: determine what actually gets done ● Sensors: monitor the job and report success/failure
  13. 13. 13 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Airflow overview Webserver Accepts HTTP requests and allows the user to interact with it. It provides the ability to act on the DAG status (pause, unpause, trigger). Scheduler Monitors the DAGs and periodically inspects tasks to see if they can be triggered. Worker Daemons that actually execute the logic of tasks.
  14. 14. 14 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Airflow UI
  15. 15. 15 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Taking it to Production
  16. 16. 16 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Challenges 1. Code dependency management 2. Automated deployment from dev to prod 3. Long running tasks infrastructure 4. Scaling airflow workers 5. Deploy to kubernetes
  17. 17. 17 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Code Dependency management - DAG Structure DAG Dependency DAG Dependency DAG Helpers DAG File
  18. 18. 18 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Code Dependency management - DAG Build
  19. 19. 19 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Automated deployment from dev to prod Package artifact with a version and push Deploy a DAG file with a specific version
  20. 20. 20 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Function as a service (FaaS) FaaS is a category of cloud computing services that provides a platform allowing customers to develop, run, and manage application functionalities without the complexity of building and maintaining the infrastructure typically associated with developing and launching an app. Source: https://en.wikipedia.org/wiki/Function_as_a_service
  21. 21. 21 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. OpenFaaS - Long running tasks infrastructure OpenFaaS (Functions as a Service) is a framework for building Serverless functions with Docker and Kubernetes which has first-class support for metrics. Any process can be packaged as a function enabling you to consume a range of web events without repetitive boiler-plate coding.
  22. 22. 22 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Airflow and OpenFaaS + =
  23. 23. 23 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Open source donations
  24. 24. 24 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Scaling airflow workers Airflow has 3 types of executors: 1. Sequential Executor - One task & one worker 2. Local Executor - Parallel tasks & one worker 3. Celery Executor - Parallel tasks & multiple workers
  25. 25. 25 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Celery executor setup
  26. 26. 26 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Deployment to kubernetes
  27. 27. 27 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. So what did we achieve? ✓ Dependency management ✓ Automated deployment from dev to prod ✓ Long running tasks infrastructure ✓ Scalable airflow workers ✓ Deploy to kubernetes
  28. 28. 28 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Annoying Problems Annoying Problems
  29. 29. 29 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Annoying Data Maintenance Problems ✓ Troubleshooting broken pipelines ✓ Rerunning parts of a pipeline
  30. 30. 30 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Classical Cron Scheduling
  31. 31. 31 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Troubleshooting broken pipelines
  32. 32. 32 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Understanding a Pipeline (DAG)
  33. 33. 33 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Sub Pipeline (sub-DAG) Create_EMR_Cluster Create_EMR_Step Step_Sensor
  34. 34. 34 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Rerunning a Task Create_EMR_Cluster Create_EMR_Step Step_Sensor Failed Task
  35. 35. 35 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Cool Features
  36. 36. 36 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Scaling Out
  37. 37. 37 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Scaling Out
  38. 38. 38 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Message Exchange (Xcom) Allows sharing data between tasks (example: configs)
  39. 39. 39 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Sample DAG
  40. 40. 40 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Sample DAG
  41. 41. 41 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
  42. 42. 42 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Summary ● Maintenance becomes a BRE E Z E …... ● Recovery is a NO BRAINER ! ● Lots of cool & convenient features
  43. 43. 43 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. ● Amazing tool for data pipelines ● Open source community ● Cool UI & API AIRFLOWAIRFLOW Make you life EASIER !!!Make you life EASIER !!!
  44. 44. 44 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Questions?
  45. 45. 45 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Come and join us! https://www.comeet.co/jobs/nielsen/33.000
  46. 46. 46 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. Thank you!
  47. 47. 47 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. BACKUP SLIDES
  48. 48. 48 Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute. { "Data-In-Dpu_us": { "bucket": "xl8emr-assets", "file": "data-in/config/dpu_airflow.json" }, "Data-In-Dpu_eu": { "bucket": "xl8emr-assets-eu", "file": "data-in/config/dpu_airflow.json" }, "Data-In-Dpu_ap": { "bucket": "xl8emr-assets-eu", "file": "data-in/config/dpu_airflow.json" } }

×