SlideShare a Scribd company logo
1 of 18
Download to read offline
B U I L D I N G A D ATA I N G E S T I O N &
P R O C E S S I N G P I P E L I N E W I T H
S PA R K & A I R F L O W
D A TA D R I V E N R I J N M O N D
Tom Lous
@tomloushttps://cdn.mg.co.za/crop/content/images/2013/07/18/Coffee_Machine_-_Truth_-_Cape_Town_DH_4271_i2e.jpg/1680x1080/
D AT L I N Q B Y N U M B E R S
W H O A R E W E ?
• 14 countries

Netherlands, Belgium, France, UK, Germany, Spain, Italy, Luxembourg
• 100 customers across Europe

Coca-Cola, Unilever, FrieslandCampina, Red Bull, Spa, AB InBev & Heineken
• 2000 active users

Working with our SaaS, insights & mobile solutions
• 1.2M outlets

Restaurants, bars, hotels, bakeries, supermarkets, kiosks, hospitals
• 5000 mutations per day

Out of business, starters, ownership change, name change, etc
• 30 students

Manually checking many changes
https://www.pexels.com/photo/people-on-dining-chair-inside-the-glass-wall-room-230736/
D AT L I N Q C O L L E C T S D ATA
W H A T D O W E D O ?
• Outlet features

Address, segmentation, opening hours, contact info, lat/long, terrace, parking, pricing
• Outlet surroundings
Nearby public transportation, school, going out area, shopping, high traffic
• Demographics
Median income, age, density, marital status
• Social Media

Presence, traffic, likes, posts
http://www.wandeltourrotterdam.nl/wp-content/uploads/photo-gallery/1506-Rotterdam-Image-Bank.jpg
D AT L I N Q ’ S M I S S I O N
W H Y D O W E D O I T ?
• Food service market transparent
• Effective marketing & sales
• Which food products are relevant where?
http://frietwinkel.nl/wp-content/uploads/2014/08/Frietwinkel-openingsfoto.jpg
W H Y S PA R K ?
S O L U T I O N
• Scaling is hard

Mo’ outlets, mo’ students
• Data != information
One-of analysis projects should be done continuously and immediately
• Work smarter not harder
Have students train AI instead of repetitive work
• Faster results

Distribute computing allows for increased processing speeds
https://s-media-cache-ak0.pinimg.com/originals/7e/63/11/7e6311d96eb4bb28a41ea34378de4182.jpg
SOLUTION?
H O W T O S TA R T
S O L U T I O N
• Focus on MVP

Minimal Viable Product that acts like a Proof of Concept
• Declare riskiest assumption
We can automatically match data from different sources
• Prove it
Small scale ML models
http://realcorfu.com/wp-content/uploads/2013/09/kiosk3.jpg
BUILDING THE PIPELINE
U S E T H E C L O U D
B U I L D I N G T H E P I P E L I N E
• Flexible usage

Pay for what you use
• Scaling out of the box
The only limit seems to be your credit card
• CLI > UI
CLI tools are your friend. UI is limited
https://s-media-cache-ak0.pinimg.com/originals/00/f8/fa/00f8fafea39a7b185de0e3c6c7bbaaec.jpg
D E V E L O P L O C A L LY
B U I L D I N G T H E P I P E L I N E
• Check versions

Your cloud provider sets the max version (e.g. Spark 2.0.2 instead of 2.1.0)
• Same code smaller data
Use samples of your data to test on
• Use at least 2 threads
Otherwise parallel execution cannot be truly tested
• Exclude libraries from deployment
Some libraries are provided on the cloud, don’t add them to your final build
https://s-media-cache-ak0.pinimg.com/originals/00/f8/fa/00f8fafea39a7b185de0e3c6c7bbaaec.jpg
C O N T I N U O U S I N T E G R AT I O N
B U I L D I N G T H E P I P E L I N E
• Deploy code

Manual uploads or deployments is asking for trouble.

Use polling instead of pushing for security

Create pipeline script to stage builds

Automated tests
• Commit config
Store CI integration
S PA R K J O B S
B U I L D I N G T H E P I P E L I N E
• All jobs in one project

Unless there is a very good reason, keep all your Jobs in one file
• Use Datasets as much as possible
Avoid RDD’s and DataFrames
• Use Case Class (encoders) as much as possible
Generic Row object will only get you so far
• Verbosity only in main class
Log messages from nodes won’t make it to the master
•Look at the execution plan
Remove temp files, clean the slate
PA R Q U E T
B U I L D I N G T H E P I P E L I N E
• What is it

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem
• Compressed
Small footprint.
• Columnar
Easier to drill down into data based on fields
• Metadata
Stores meta information easy to use by spark (counts, etc)
• Partitioning
Allows for partitioning on values for faster access
E L A S T I C S E A R C H
B U I L D I N G T H E P I P E L I N E
• NxN matching is computationally expensive

Querying Elasticsearch / Lucene / Solr less
Fuzzy searching using n-grams and other tokenizers
• Cluster out of the box
Auto replication
• Config differs from normal use
It’s not a search feature for a website. 

Don’t care about update performance

Don’t load balance queries (_local)
• Discovery Plugins for GC
So nodes can find each other on local network without Zen
Create VM’s with --scopes=compute-rw
A N S I B L E
B U I L D I N G T H E P I P E L I N E
• Automate VM’s / agent less

Elasticsearch cluster
• Dependant on Ansible GC Modules

gcloud keeps updating. Ansible GC doesn’t directly follow suite
A I R F L O W
B U I L D I N G T H E P I P E L I N E
• Create workflows

Define all tasks and dependencies
• Cron on steroids
Starts every day
• Dependencies
Tasks are dependent on tasks upstream
• Backfill / rerun
Run as if in the past
BUILDING THE PIPELINE
F I N A L LY
C O N C L U S I O N
• What’s next?

Improving MVP

Adding more machine learning

Adding more datasources

Adding more countries

Improving architecture
• Production
Create an API for use in our products

Create UI for exploring the data

Exporting to standard databases to further use
• Team
Hire Data Engineer(s)
Hire Data Scientist(s)
https://static.pexels.com/photos/14107/pexels-photo-14107.jpeg
T H A N K S F O R WAT C H I N G
C O N C L U S I O N
• Questions?

AMA
• Vacancies!
Ask me or check our site http://www.datlinq.com/en/vacancies
• Break
Grab a drink an let’s reconvene after a 15 min break

More Related Content

What's hot

Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | GrafanaStreaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | GrafanaInfluxData
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with LuigiTeemu Kurppa
 
The IDI Digital Transformation - OpenStack Day Israel 2016
The IDI Digital Transformation - OpenStack Day Israel 2016The IDI Digital Transformation - OpenStack Day Israel 2016
The IDI Digital Transformation - OpenStack Day Israel 2016Cloud Native Day Tel Aviv
 
Scalable Eventing Over Apache Mesos
Scalable Eventing Over Apache MesosScalable Eventing Over Apache Mesos
Scalable Eventing Over Apache MesosOlivier Paugam
 
Workers and Worker Patterns at Scale
Workers and Worker Patterns at ScaleWorkers and Worker Patterns at Scale
Workers and Worker Patterns at ScaleChad Arimura
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...Itai Yaffe
 
Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013Thoughtworks
 
Agile sre fly beyond the clouds olx core went aws- Wojciech Krysmann
Agile sre fly beyond the clouds   olx core went aws- Wojciech KrysmannAgile sre fly beyond the clouds   olx core went aws- Wojciech Krysmann
Agile sre fly beyond the clouds olx core went aws- Wojciech KrysmannPROIDEA
 
Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...
Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...
Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...Thoughtworks
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow ManagementRomi Kuntsman
 
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetFrom Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetPôle Systematic Paris-Region
 
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...Cloud Native Day Tel Aviv
 
Hyperloglog Lightning Talk
Hyperloglog Lightning TalkHyperloglog Lightning Talk
Hyperloglog Lightning TalkSimon Prickett
 
Testing and Monitoring and Broken Things | Nikki Attea | Sensu
Testing and Monitoring and Broken Things | Nikki Attea | SensuTesting and Monitoring and Broken Things | Nikki Attea | Sensu
Testing and Monitoring and Broken Things | Nikki Attea | SensuInfluxData
 
Start Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPopStart Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPopJason Plurad
 
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...New Relic
 

What's hot (20)

Hadoop summit-ams-2014-04-03
Hadoop summit-ams-2014-04-03Hadoop summit-ams-2014-04-03
Hadoop summit-ams-2014-04-03
 
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | GrafanaStreaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with Luigi
 
The IDI Digital Transformation - OpenStack Day Israel 2016
The IDI Digital Transformation - OpenStack Day Israel 2016The IDI Digital Transformation - OpenStack Day Israel 2016
The IDI Digital Transformation - OpenStack Day Israel 2016
 
Scalable Eventing Over Apache Mesos
Scalable Eventing Over Apache MesosScalable Eventing Over Apache Mesos
Scalable Eventing Over Apache Mesos
 
Workers and Worker Patterns at Scale
Workers and Worker Patterns at ScaleWorkers and Worker Patterns at Scale
Workers and Worker Patterns at Scale
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
 
Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013
 
Agile sre fly beyond the clouds olx core went aws- Wojciech Krysmann
Agile sre fly beyond the clouds   olx core went aws- Wojciech KrysmannAgile sre fly beyond the clouds   olx core went aws- Wojciech Krysmann
Agile sre fly beyond the clouds olx core went aws- Wojciech Krysmann
 
Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...
Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...
Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Go With The Flow
Go With The FlowGo With The Flow
Go With The Flow
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
 
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetFrom Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
 
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
 
Hyperloglog Lightning Talk
Hyperloglog Lightning TalkHyperloglog Lightning Talk
Hyperloglog Lightning Talk
 
Testing and Monitoring and Broken Things | Nikki Attea | Sensu
Testing and Monitoring and Broken Things | Nikki Attea | SensuTesting and Monitoring and Broken Things | Nikki Attea | Sensu
Testing and Monitoring and Broken Things | Nikki Attea | Sensu
 
Start Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPopStart Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPop
 
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...
 

Viewers also liked

How to write maintainable code
How to write maintainable codeHow to write maintainable code
How to write maintainable codePeter Hilton
 
Data Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceData Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceSrivatsan Ramanujam
 
Transforming Data to Unlock Its Latent Value
Transforming Data to Unlock Its Latent ValueTransforming Data to Unlock Its Latent Value
Transforming Data to Unlock Its Latent ValueTony Ojeda
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...Ilkay Altintas, Ph.D.
 
Gartner Predictions for Hadoop
Gartner Predictions for HadoopGartner Predictions for Hadoop
Gartner Predictions for HadoopBruno Aziza
 
Big Data Analytics Principles
Big Data Analytics PrinciplesBig Data Analytics Principles
Big Data Analytics PrinciplesBruno Aziza
 
DataLab DataQuality Dimensions
DataLab DataQuality DimensionsDataLab DataQuality Dimensions
DataLab DataQuality DimensionsCarlos Guerreiro
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
 
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data ServicesMarlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data ServicesMarlabs
 
The Laws of Data Science Gravity
The Laws of Data Science GravityThe Laws of Data Science Gravity
The Laws of Data Science GravityBruno Aziza
 
Big Data for the CMO
Big Data for the CMOBig Data for the CMO
Big Data for the CMOBruno Aziza
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data PipelinesChristian Gügi
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit
 

Viewers also liked (20)

How to write maintainable code
How to write maintainable codeHow to write maintainable code
How to write maintainable code
 
Data Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceData Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data Science
 
Transforming Data to Unlock Its Latent Value
Transforming Data to Unlock Its Latent ValueTransforming Data to Unlock Its Latent Value
Transforming Data to Unlock Its Latent Value
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
 
Big datalab
Big datalabBig datalab
Big datalab
 
Gartner Predictions for Hadoop
Gartner Predictions for HadoopGartner Predictions for Hadoop
Gartner Predictions for Hadoop
 
Big Data Analytics Principles
Big Data Analytics PrinciplesBig Data Analytics Principles
Big Data Analytics Principles
 
DataLab DataQuality Dimensions
DataLab DataQuality DimensionsDataLab DataQuality Dimensions
DataLab DataQuality Dimensions
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data ServicesMarlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
 
The Laws of Data Science Gravity
The Laws of Data Science GravityThe Laws of Data Science Gravity
The Laws of Data Science Gravity
 
Big Data for the CMO
Big Data for the CMOBig Data for the CMO
Big Data for the CMO
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
From Business Intelligence to Predictive Analytics
From Business Intelligence to Predictive AnalyticsFrom Business Intelligence to Predictive Analytics
From Business Intelligence to Predictive Analytics
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data Pipelines
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 

Similar to Building a Data Ingestion & Processing Pipeline with Spark & Airflow

RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...Databricks
 
Transforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOpsTransforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOpsNicolas (Nick) Barcet
 
The 12 Factor App
The 12 Factor AppThe 12 Factor App
The 12 Factor Apprudiyardley
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorchgeetachauhan
 
OSDC 2018 | Puppet and the Road to Pervasive Automation by Walter Gildersleeve
OSDC 2018 | Puppet and the Road to Pervasive Automation by Walter GildersleeveOSDC 2018 | Puppet and the Road to Pervasive Automation by Walter Gildersleeve
OSDC 2018 | Puppet and the Road to Pervasive Automation by Walter GildersleeveNETWAYS
 
200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet CodeDavid Danzilio
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Productioniguazio
 
So Your OpenStack Cloud is Built...Now What?
So Your OpenStack Cloud is Built...Now What? So Your OpenStack Cloud is Built...Now What?
So Your OpenStack Cloud is Built...Now What? Tesora
 
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit EbnerThe NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit EbnerNRB
 
How to Transform Into a Data-Driven Organization
How to Transform Into a Data-Driven OrganizationHow to Transform Into a Data-Driven Organization
How to Transform Into a Data-Driven OrganizationWarrenCruz3
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Metrics 4 faster feedback
Metrics 4 faster feedbackMetrics 4 faster feedback
Metrics 4 faster feedbackKris Buytaert
 
Migrating Core Enterprise Applications to the Cloud
Migrating Core Enterprise Applications to the CloudMigrating Core Enterprise Applications to the Cloud
Migrating Core Enterprise Applications to the CloudRoger Valade
 
Seeing RED: Monitoring and Observability in the Age of Microservices
Seeing RED: Monitoring and Observability in the Age of MicroservicesSeeing RED: Monitoring and Observability in the Age of Microservices
Seeing RED: Monitoring and Observability in the Age of MicroservicesDave McAllister
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesRob Winters
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
 
Tools and best practices for sustainable software
Tools and best practices for sustainable softwareTools and best practices for sustainable software
Tools and best practices for sustainable softwareGreen Software Development
 

Similar to Building a Data Ingestion & Processing Pipeline with Spark & Airflow (20)

RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
 
Transforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOpsTransforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOps
 
The 12 Factor App
The 12 Factor AppThe 12 Factor App
The 12 Factor App
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
 
OSDC 2018 | Puppet and the Road to Pervasive Automation by Walter Gildersleeve
OSDC 2018 | Puppet and the Road to Pervasive Automation by Walter GildersleeveOSDC 2018 | Puppet and the Road to Pervasive Automation by Walter Gildersleeve
OSDC 2018 | Puppet and the Road to Pervasive Automation by Walter Gildersleeve
 
200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
So Your OpenStack Cloud is Built...Now What?
So Your OpenStack Cloud is Built...Now What? So Your OpenStack Cloud is Built...Now What?
So Your OpenStack Cloud is Built...Now What?
 
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit EbnerThe NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
 
How to Transform Into a Data-Driven Organization
How to Transform Into a Data-Driven OrganizationHow to Transform Into a Data-Driven Organization
How to Transform Into a Data-Driven Organization
 
The domino maze
The domino mazeThe domino maze
The domino maze
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Metrics 4 faster feedback
Metrics 4 faster feedbackMetrics 4 faster feedback
Metrics 4 faster feedback
 
Migrating Core Enterprise Applications to the Cloud
Migrating Core Enterprise Applications to the CloudMigrating Core Enterprise Applications to the Cloud
Migrating Core Enterprise Applications to the Cloud
 
Devops For Drupal
Devops  For DrupalDevops  For Drupal
Devops For Drupal
 
Seeing RED: Monitoring and Observability in the Age of Microservices
Seeing RED: Monitoring and Observability in the Age of MicroservicesSeeing RED: Monitoring and Observability in the Age of Microservices
Seeing RED: Monitoring and Observability in the Age of Microservices
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
 
Deployments in one click!
Deployments in one click!Deployments in one click!
Deployments in one click!
 
Tools and best practices for sustainable software
Tools and best practices for sustainable softwareTools and best practices for sustainable software
Tools and best practices for sustainable software
 

Recently uploaded

Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Piping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringPiping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringJuanCarlosMorales19600
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 

Recently uploaded (20)

Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Piping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringPiping Basic stress analysis by engineering
Piping Basic stress analysis by engineering
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 

Building a Data Ingestion & Processing Pipeline with Spark & Airflow

  • 1. B U I L D I N G A D ATA I N G E S T I O N & P R O C E S S I N G P I P E L I N E W I T H S PA R K & A I R F L O W D A TA D R I V E N R I J N M O N D Tom Lous @tomloushttps://cdn.mg.co.za/crop/content/images/2013/07/18/Coffee_Machine_-_Truth_-_Cape_Town_DH_4271_i2e.jpg/1680x1080/
  • 2. D AT L I N Q B Y N U M B E R S W H O A R E W E ? • 14 countries
 Netherlands, Belgium, France, UK, Germany, Spain, Italy, Luxembourg • 100 customers across Europe
 Coca-Cola, Unilever, FrieslandCampina, Red Bull, Spa, AB InBev & Heineken • 2000 active users
 Working with our SaaS, insights & mobile solutions • 1.2M outlets
 Restaurants, bars, hotels, bakeries, supermarkets, kiosks, hospitals • 5000 mutations per day
 Out of business, starters, ownership change, name change, etc • 30 students
 Manually checking many changes https://www.pexels.com/photo/people-on-dining-chair-inside-the-glass-wall-room-230736/
  • 3. D AT L I N Q C O L L E C T S D ATA W H A T D O W E D O ? • Outlet features
 Address, segmentation, opening hours, contact info, lat/long, terrace, parking, pricing • Outlet surroundings Nearby public transportation, school, going out area, shopping, high traffic • Demographics Median income, age, density, marital status • Social Media
 Presence, traffic, likes, posts http://www.wandeltourrotterdam.nl/wp-content/uploads/photo-gallery/1506-Rotterdam-Image-Bank.jpg
  • 4. D AT L I N Q ’ S M I S S I O N W H Y D O W E D O I T ? • Food service market transparent • Effective marketing & sales • Which food products are relevant where? http://frietwinkel.nl/wp-content/uploads/2014/08/Frietwinkel-openingsfoto.jpg
  • 5. W H Y S PA R K ? S O L U T I O N • Scaling is hard
 Mo’ outlets, mo’ students • Data != information One-of analysis projects should be done continuously and immediately • Work smarter not harder Have students train AI instead of repetitive work • Faster results
 Distribute computing allows for increased processing speeds https://s-media-cache-ak0.pinimg.com/originals/7e/63/11/7e6311d96eb4bb28a41ea34378de4182.jpg SOLUTION?
  • 6. H O W T O S TA R T S O L U T I O N • Focus on MVP
 Minimal Viable Product that acts like a Proof of Concept • Declare riskiest assumption We can automatically match data from different sources • Prove it Small scale ML models http://realcorfu.com/wp-content/uploads/2013/09/kiosk3.jpg
  • 8. U S E T H E C L O U D B U I L D I N G T H E P I P E L I N E • Flexible usage
 Pay for what you use • Scaling out of the box The only limit seems to be your credit card • CLI > UI CLI tools are your friend. UI is limited https://s-media-cache-ak0.pinimg.com/originals/00/f8/fa/00f8fafea39a7b185de0e3c6c7bbaaec.jpg
  • 9. D E V E L O P L O C A L LY B U I L D I N G T H E P I P E L I N E • Check versions
 Your cloud provider sets the max version (e.g. Spark 2.0.2 instead of 2.1.0) • Same code smaller data Use samples of your data to test on • Use at least 2 threads Otherwise parallel execution cannot be truly tested • Exclude libraries from deployment Some libraries are provided on the cloud, don’t add them to your final build https://s-media-cache-ak0.pinimg.com/originals/00/f8/fa/00f8fafea39a7b185de0e3c6c7bbaaec.jpg
  • 10. C O N T I N U O U S I N T E G R AT I O N B U I L D I N G T H E P I P E L I N E • Deploy code
 Manual uploads or deployments is asking for trouble.
 Use polling instead of pushing for security
 Create pipeline script to stage builds
 Automated tests • Commit config Store CI integration
  • 11. S PA R K J O B S B U I L D I N G T H E P I P E L I N E • All jobs in one project
 Unless there is a very good reason, keep all your Jobs in one file • Use Datasets as much as possible Avoid RDD’s and DataFrames • Use Case Class (encoders) as much as possible Generic Row object will only get you so far • Verbosity only in main class Log messages from nodes won’t make it to the master •Look at the execution plan Remove temp files, clean the slate
  • 12. PA R Q U E T B U I L D I N G T H E P I P E L I N E • What is it
 Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem • Compressed Small footprint. • Columnar Easier to drill down into data based on fields • Metadata Stores meta information easy to use by spark (counts, etc) • Partitioning Allows for partitioning on values for faster access
  • 13. E L A S T I C S E A R C H B U I L D I N G T H E P I P E L I N E • NxN matching is computationally expensive
 Querying Elasticsearch / Lucene / Solr less Fuzzy searching using n-grams and other tokenizers • Cluster out of the box Auto replication • Config differs from normal use It’s not a search feature for a website. 
 Don’t care about update performance
 Don’t load balance queries (_local) • Discovery Plugins for GC So nodes can find each other on local network without Zen Create VM’s with --scopes=compute-rw
  • 14. A N S I B L E B U I L D I N G T H E P I P E L I N E • Automate VM’s / agent less
 Elasticsearch cluster • Dependant on Ansible GC Modules
 gcloud keeps updating. Ansible GC doesn’t directly follow suite
  • 15. A I R F L O W B U I L D I N G T H E P I P E L I N E • Create workflows
 Define all tasks and dependencies • Cron on steroids Starts every day • Dependencies Tasks are dependent on tasks upstream • Backfill / rerun Run as if in the past
  • 17. F I N A L LY C O N C L U S I O N • What’s next?
 Improving MVP
 Adding more machine learning
 Adding more datasources
 Adding more countries
 Improving architecture • Production Create an API for use in our products
 Create UI for exploring the data
 Exporting to standard databases to further use • Team Hire Data Engineer(s) Hire Data Scientist(s) https://static.pexels.com/photos/14107/pexels-photo-14107.jpeg
  • 18. T H A N K S F O R WAT C H I N G C O N C L U S I O N • Questions?
 AMA • Vacancies! Ask me or check our site http://www.datlinq.com/en/vacancies • Break Grab a drink an let’s reconvene after a 15 min break