SlideShare a Scribd company logo
1 of 40
Download to read offline
PyconJP
2016-09-22
Fabian Dubois
Building a data
preparation
pipeline with
Pandas and
AWS Lambda
Building a data preparation pipeline with Pandas and AWS Lambda
What Will You Learn?
▸ What is data preparation and why it is required.
▸ How to prepare data with pandas.
▸ How to set up a pipeline with AWS Lambda
Building a data preparation pipeline with Pandas and AWS Lambda
About Me
▸ Based in Tokyo
▸ Using python with data for 6 years
▸ Freelance Data Products Developper and Consultant

(data visualization, machine learning)
▸ Former Orange Labs and Locarise

(connected sensors data processing and visualization)
▸ Current side project denryoku.io an API for electric grid
power demand and capacity prediction.
Why Data
Preparation?
Building a data preparation pipeline with Pandas and AWS Lambda
So you have got data, now what?
▸ Showing it to an audience:
▸ a report from a survey?
▸ a news article with charts?
▸ a sales dashboard?
Building a data preparation pipeline with Pandas and AWS Lambda
But a lot of available data is messy
▸ incomplete or missing data
▸ mis-formatted, mis-typed data
▸ wrong / corrupted values
Building a data preparation pipeline with Pandas and AWS Lambda
It has all the reasons to be messy
▸ non availability
▸ no appropriate mean of collection
▸ lack of validation
▸ human errors
Building a data preparation pipeline with Pandas and AWS Lambda
And this can have very bad consequences
▸ Crash in your report generator
▸ incomplete reports
▸ report reaches wrong conclusions
▸ Ultimately, if your data is really bad, you cannot trust any
conclusion from it
Building a data preparation pipeline with Pandas and AWS Lambda
It is not just about quality (ETL)
▸ Enriching the data
▸ Aggregating
!" "
clean
" !clean
!
aggregate,

classify, …input 1
input 2
output
▸ Classification (ML)
▸ Predictions (ML)
Visualize
|
Building a data preparation pipeline with Pandas and AWS Lambda
Example: data journalism &
interactive visualization
▸ Often manually gathered
data in spreadsheets
▸ Data cleaning required
▸ Data aggregation/
preprocessing required
▸ Data may be updated on a
weekly basis
Building a data preparation pipeline with Pandas and AWS Lambda
If it is a product, it needs to deal with data updates
Current Data
!
preparation script visualisation ready data Visualisation
" " |
▸ Who is going to run the script?
"
New data
Needs to be automated (the pipeline)
Building a data preparation pipeline with Pandas and AWS Lambda
What does it apply to?
data
quality
data update
frequency once monthly real-timedaily
low high
dashboards,
data products
data journalism
interactive reports,
email reports
ad hoc data analysisapplication
solution jupiter notebook
automated preparation

pipeline (batch)

prototype
micro-batch or real-time

processing pipeline
our focus
How to
prepare data?
Building a data preparation pipeline with Pandas and AWS Lambda
common operations
▸ Date parsing
▸ Deciding on a strategy for null or non parseable values
▸ Enforce value ranges
▸ Sanitise strings
Building a data preparation pipeline with Pandas and AWS Lambda
Existing tools
▸ Trifacta Wrangler, Talend Dataprep, Google Open Refine
▸ great tools to check data quality and define transformations
Building a data preparation pipeline with Pandas and AWS Lambda
So why custom solutions with Python and Pandas?
▸ With python, you can do anything!
▸ It is not that difficult
▸ Pandas is a versatile tool that manipulate Dataframes
▸ Easy to specify transformations
▸ Not limited by Pandas, the whole python ecosystem is
available, like scikit-learn
Building a data preparation pipeline with Pandas and AWS Lambda
Example from a Jupiter notebook
▸ load a simple file with a list of name and ages of different
persons
Building a data preparation pipeline with Pandas and AWS Lambda
Example: statistics on groups (names)
▸ Is there a
relationship
between name
length and
median age?
▸ Chain
operations
▸ plot the length
of name vs age
for each name
Warning
Outlier
Building a data preparation pipeline with Pandas and AWS Lambda
something
is wrong
null values
label issues
Building a data preparation pipeline with Pandas and AWS Lambda
Let’s fix this
▸ deal with
missing values
with `dropna`
or `fillna`
▸ clean names
▸ reject outliers
Building a data preparation pipeline with Pandas and AWS Lambda
Close the loop to improve the data entry/acquisition
▸ Many errors can be avoided during data collection:
▸ form / column validation
▸ drop down selections for categories
▸ Report rejected rows to improve collection process
$
Data
! preparation

script"
list of issues
%Improve

forms…
Building a data preparation pipeline with Pandas and AWS Lambda
Testing your preparation
▸ Unit tests
▸ Test for anticipated edge cases (defensive programming)
▸ Property based testing (http://hypothesis.works/)
Building a data preparation pipeline with Pandas and AWS Lambda
More references for data cleaning
▸ Data cleaning with Pandas https://www.youtube.com/
watch?v=_eQ_8U5kruQ
▸ Data cleanup with Python: http://kjamistan.com/
automating-your-data-cleanup-with-python/
▸ Modern Pandas: Tidy Data https://
tomaugspurger.github.io/modern-5-tidy.html
Setting up a
pipeline with AWS
Lambda.
Building a data preparation pipeline with Pandas and AWS Lambda
Some challenges
▸ Don’t let users run scripts
▸ Automating is part of a quality process
▸ Keeping things simple…
▸ and cheap
Building a data preparation pipeline with Pandas and AWS Lambda
What is AWS Lambda: server less solution
▸ Serverless offer by AWS
▸ No lifecycle to manage or shared state => resilient
▸ Auto-scaling
▸ Pay for actual running time: low cost
▸ No server, infra management: reduced dev / devops cost
…events
lambda function
output
…
Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function
just a python function
Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function: options
Building a data preparation pipeline with Pandas and AWS Lambda
Creating an “architecture” with triggers
Building a data preparation pipeline with Pandas and AWS Lambda
Batch processing at regular interval
▸ cron scheduling
▸ let your function get some data and process it at regular interval
Building a data preparation pipeline with Pandas and AWS Lambda
An API / webhook
▸ on API call
▸ Can be triggered from a google spreadsheet
Building a data preparation pipeline with Pandas and AWS Lambda
Setting up AWS Lambda for Pandas
Pandas and dependencies need to be compiled for Amazon
Linux x86_64 # install compilation environment
sudo yum -y update
sudo yum -y upgrade
sudo yum groupinstall "Development Tools"
sudo yum install blas blas-devel lapack 
lapack-devel Cython --enablerepo=epel
# create and activate virtual env
virtualenv pdenv
source pdenv/bin/activate
# install pandas
pip install pandas
# zip the environment content
cd ~/pdenv/lib/python2.7/site-packages/
zip -r ~/pdenv.zip . --exclude *.pyc
cd ~/pdenv/lib64/python2.7/site-packages/
zip -r ~/pdenv.zip . --exclude *.pyc
# add the supporting libraries
cd ~/
mkdir -p libs
cp /usr/lib64/liblapack.so.3 
/usr/lib64/libblas.so.3 
/usr/lib64/libgfortran.so.3 
/usr/lib64/libquadmath.so.0 
libs/
zip -r ~/pdenv.zip libs
1. Launch an
EC2 instance
and connect
to it
2. Install
pandas in a
virtualenv
3. Zip the
installed
libraries
shell
Building a data preparation pipeline with Pandas and AWS Lambda
Using pandas from a lambda function
▸ The lambda process
need to access those
binaries
▸ Set up env variables
▸ Call a subprocess
▸ And pickle the function
input
▸ AWS will call
`lambda_function.lambda
_handler`
import os, sys, subprocess, json
import cPickle as pickle
LIBS = os.path.join(os.getcwd(), 'local', 'lib')
def handler(filename):
def handle(event, context):
pickle.dump( event, open( “/tmp/event.p”, “wb” ))
env = os.environ.copy()
env.update(LD_LIBRARY_PATH=LIBS)
proc = subprocess.Popen(
('python', filename),
env=env,
stdout=subprocess.PIPE)
proc.wait()
return proc.stdout.read()
return handle
lambda_handler = handler('my_function.py')
python: lambda_function.py
Building a data preparation pipeline with Pandas and AWS Lambda
The actual function
▸ Get the input data from
a google spreadsheet,
a css file on s3, an FTP
▸ Clean it
▸ Copy it somewhere
import pandas as pd
import pickle
import requests
from StringIO import StringIO
def run():
# get the lambda call arguments
event = pickle.load( open( “/tmp/event.p”, “rb” ))
# load some data from a google spreadsheet
r = requests.get(‘https://docs.google.com/spreadsheets'
+ ‘/d/{sheet_id}/export?format=csv&gid={page_id}')
data = r.content.decode('utf-8')
df = pd.read_csv(StringIO(data))
# Do something
# save as file
file_ = StringIO()
df.to_csv(file_, encoding='utf-8')
# copy the result somewhere
if __name__ == '__main__':
run()
python: my_function.py
Building a data preparation pipeline with Pandas and AWS Lambda
upload and test
▸ add your lambda function code to the environment zip.
▸ upload your function
Building a data preparation pipeline with Pandas and AWS Lambda
caveat 1: python 2.7
▸ officially, only python 2.7 is supported
▸ But python 3 is available and can be called as a
subprocess
▸ details here: http://www.cloudtrek.com.au/blog/
running-python-3-on-aws-lambda/
Building a data preparation pipeline with Pandas and AWS Lambda
caveat 2: max process memory (1.5GB) and execution time
▸ need to split the dataset if tool large
▸ loop over in your lambda call:
▸ may excess timeout
▸ map to multiple lambda calls
▸ need to merge the dataset at the end
▸ Lambda functions should be simple, chain if required
Takeaways
Building a data preparation pipeline with Pandas and AWS Lambda
Takeaways
▸ Know your data and your target
▸ Pandas can solve many issues
▸ Defensive programming and closing the loop
▸ AWS Lambda is a powerful and flexible tool for time and
resource constrained teams
Thanks
Questions?
@fabian_dubois
fabian@datamaplab.com
check denryoku.io

More Related Content

What's hot

Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.OrgHw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org
Cloudera, Inc.
 
Autoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDBAutoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDB
Sebastian Dahlgren
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit
 

What's hot (20)

Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.OrgHw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
 
Autoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDBAutoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDB
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS Glue
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte Scale
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetup
 
(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming
 
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
 
Dependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark Applications
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
 
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
 
Building Data Pipelines in Python
Building Data Pipelines in PythonBuilding Data Pipelines in Python
Building Data Pipelines in Python
 
Mhug apache storm
Mhug apache stormMhug apache storm
Mhug apache storm
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
 

Similar to PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 

Similar to PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda (20)

Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as Code
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as Code
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as Code
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
AWS May Webinar Series - Deep Dive: Infrastructure as Code
AWS May Webinar Series - Deep Dive: Infrastructure as CodeAWS May Webinar Series - Deep Dive: Infrastructure as Code
AWS May Webinar Series - Deep Dive: Infrastructure as Code
 
Deep Dive - Infrastructure as Code
Deep Dive - Infrastructure as CodeDeep Dive - Infrastructure as Code
Deep Dive - Infrastructure as Code
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data Layer
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Tensorflow in production with AWS Lambda
Tensorflow in production with AWS LambdaTensorflow in production with AWS Lambda
Tensorflow in production with AWS Lambda
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on aws
 
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
 
Analyzing Mixpanel Data into Amazon Redshift
Analyzing Mixpanel Data into Amazon RedshiftAnalyzing Mixpanel Data into Amazon Redshift
Analyzing Mixpanel Data into Amazon Redshift
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 

Recently uploaded

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Recently uploaded (20)

AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 

PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

  • 1. PyconJP 2016-09-22 Fabian Dubois Building a data preparation pipeline with Pandas and AWS Lambda
  • 2. Building a data preparation pipeline with Pandas and AWS Lambda What Will You Learn? ▸ What is data preparation and why it is required. ▸ How to prepare data with pandas. ▸ How to set up a pipeline with AWS Lambda
  • 3. Building a data preparation pipeline with Pandas and AWS Lambda About Me ▸ Based in Tokyo ▸ Using python with data for 6 years ▸ Freelance Data Products Developper and Consultant
 (data visualization, machine learning) ▸ Former Orange Labs and Locarise
 (connected sensors data processing and visualization) ▸ Current side project denryoku.io an API for electric grid power demand and capacity prediction.
  • 5. Building a data preparation pipeline with Pandas and AWS Lambda So you have got data, now what? ▸ Showing it to an audience: ▸ a report from a survey? ▸ a news article with charts? ▸ a sales dashboard?
  • 6. Building a data preparation pipeline with Pandas and AWS Lambda But a lot of available data is messy ▸ incomplete or missing data ▸ mis-formatted, mis-typed data ▸ wrong / corrupted values
  • 7. Building a data preparation pipeline with Pandas and AWS Lambda It has all the reasons to be messy ▸ non availability ▸ no appropriate mean of collection ▸ lack of validation ▸ human errors
  • 8. Building a data preparation pipeline with Pandas and AWS Lambda And this can have very bad consequences ▸ Crash in your report generator ▸ incomplete reports ▸ report reaches wrong conclusions ▸ Ultimately, if your data is really bad, you cannot trust any conclusion from it
  • 9. Building a data preparation pipeline with Pandas and AWS Lambda It is not just about quality (ETL) ▸ Enriching the data ▸ Aggregating !" " clean " !clean ! aggregate,
 classify, …input 1 input 2 output ▸ Classification (ML) ▸ Predictions (ML) Visualize |
  • 10. Building a data preparation pipeline with Pandas and AWS Lambda Example: data journalism & interactive visualization ▸ Often manually gathered data in spreadsheets ▸ Data cleaning required ▸ Data aggregation/ preprocessing required ▸ Data may be updated on a weekly basis
  • 11. Building a data preparation pipeline with Pandas and AWS Lambda If it is a product, it needs to deal with data updates Current Data ! preparation script visualisation ready data Visualisation " " | ▸ Who is going to run the script? " New data Needs to be automated (the pipeline)
  • 12. Building a data preparation pipeline with Pandas and AWS Lambda What does it apply to? data quality data update frequency once monthly real-timedaily low high dashboards, data products data journalism interactive reports, email reports ad hoc data analysisapplication solution jupiter notebook automated preparation
 pipeline (batch)
 prototype micro-batch or real-time
 processing pipeline our focus
  • 14. Building a data preparation pipeline with Pandas and AWS Lambda common operations ▸ Date parsing ▸ Deciding on a strategy for null or non parseable values ▸ Enforce value ranges ▸ Sanitise strings
  • 15. Building a data preparation pipeline with Pandas and AWS Lambda Existing tools ▸ Trifacta Wrangler, Talend Dataprep, Google Open Refine ▸ great tools to check data quality and define transformations
  • 16. Building a data preparation pipeline with Pandas and AWS Lambda So why custom solutions with Python and Pandas? ▸ With python, you can do anything! ▸ It is not that difficult ▸ Pandas is a versatile tool that manipulate Dataframes ▸ Easy to specify transformations ▸ Not limited by Pandas, the whole python ecosystem is available, like scikit-learn
  • 17. Building a data preparation pipeline with Pandas and AWS Lambda Example from a Jupiter notebook ▸ load a simple file with a list of name and ages of different persons
  • 18. Building a data preparation pipeline with Pandas and AWS Lambda Example: statistics on groups (names) ▸ Is there a relationship between name length and median age? ▸ Chain operations ▸ plot the length of name vs age for each name Warning Outlier
  • 19. Building a data preparation pipeline with Pandas and AWS Lambda something is wrong null values label issues
  • 20. Building a data preparation pipeline with Pandas and AWS Lambda Let’s fix this ▸ deal with missing values with `dropna` or `fillna` ▸ clean names ▸ reject outliers
  • 21. Building a data preparation pipeline with Pandas and AWS Lambda Close the loop to improve the data entry/acquisition ▸ Many errors can be avoided during data collection: ▸ form / column validation ▸ drop down selections for categories ▸ Report rejected rows to improve collection process $ Data ! preparation
 script" list of issues %Improve
 forms…
  • 22. Building a data preparation pipeline with Pandas and AWS Lambda Testing your preparation ▸ Unit tests ▸ Test for anticipated edge cases (defensive programming) ▸ Property based testing (http://hypothesis.works/)
  • 23. Building a data preparation pipeline with Pandas and AWS Lambda More references for data cleaning ▸ Data cleaning with Pandas https://www.youtube.com/ watch?v=_eQ_8U5kruQ ▸ Data cleanup with Python: http://kjamistan.com/ automating-your-data-cleanup-with-python/ ▸ Modern Pandas: Tidy Data https:// tomaugspurger.github.io/modern-5-tidy.html
  • 24. Setting up a pipeline with AWS Lambda.
  • 25. Building a data preparation pipeline with Pandas and AWS Lambda Some challenges ▸ Don’t let users run scripts ▸ Automating is part of a quality process ▸ Keeping things simple… ▸ and cheap
  • 26. Building a data preparation pipeline with Pandas and AWS Lambda What is AWS Lambda: server less solution ▸ Serverless offer by AWS ▸ No lifecycle to manage or shared state => resilient ▸ Auto-scaling ▸ Pay for actual running time: low cost ▸ No server, infra management: reduced dev / devops cost …events lambda function output …
  • 27. Building a data preparation pipeline with Pandas and AWS Lambda Creating a function just a python function
  • 28. Building a data preparation pipeline with Pandas and AWS Lambda Creating a function: options
  • 29. Building a data preparation pipeline with Pandas and AWS Lambda Creating an “architecture” with triggers
  • 30. Building a data preparation pipeline with Pandas and AWS Lambda Batch processing at regular interval ▸ cron scheduling ▸ let your function get some data and process it at regular interval
  • 31. Building a data preparation pipeline with Pandas and AWS Lambda An API / webhook ▸ on API call ▸ Can be triggered from a google spreadsheet
  • 32. Building a data preparation pipeline with Pandas and AWS Lambda Setting up AWS Lambda for Pandas Pandas and dependencies need to be compiled for Amazon Linux x86_64 # install compilation environment sudo yum -y update sudo yum -y upgrade sudo yum groupinstall "Development Tools" sudo yum install blas blas-devel lapack lapack-devel Cython --enablerepo=epel # create and activate virtual env virtualenv pdenv source pdenv/bin/activate # install pandas pip install pandas # zip the environment content cd ~/pdenv/lib/python2.7/site-packages/ zip -r ~/pdenv.zip . --exclude *.pyc cd ~/pdenv/lib64/python2.7/site-packages/ zip -r ~/pdenv.zip . --exclude *.pyc # add the supporting libraries cd ~/ mkdir -p libs cp /usr/lib64/liblapack.so.3 /usr/lib64/libblas.so.3 /usr/lib64/libgfortran.so.3 /usr/lib64/libquadmath.so.0 libs/ zip -r ~/pdenv.zip libs 1. Launch an EC2 instance and connect to it 2. Install pandas in a virtualenv 3. Zip the installed libraries shell
  • 33. Building a data preparation pipeline with Pandas and AWS Lambda Using pandas from a lambda function ▸ The lambda process need to access those binaries ▸ Set up env variables ▸ Call a subprocess ▸ And pickle the function input ▸ AWS will call `lambda_function.lambda _handler` import os, sys, subprocess, json import cPickle as pickle LIBS = os.path.join(os.getcwd(), 'local', 'lib') def handler(filename): def handle(event, context): pickle.dump( event, open( “/tmp/event.p”, “wb” )) env = os.environ.copy() env.update(LD_LIBRARY_PATH=LIBS) proc = subprocess.Popen( ('python', filename), env=env, stdout=subprocess.PIPE) proc.wait() return proc.stdout.read() return handle lambda_handler = handler('my_function.py') python: lambda_function.py
  • 34. Building a data preparation pipeline with Pandas and AWS Lambda The actual function ▸ Get the input data from a google spreadsheet, a css file on s3, an FTP ▸ Clean it ▸ Copy it somewhere import pandas as pd import pickle import requests from StringIO import StringIO def run(): # get the lambda call arguments event = pickle.load( open( “/tmp/event.p”, “rb” )) # load some data from a google spreadsheet r = requests.get(‘https://docs.google.com/spreadsheets' + ‘/d/{sheet_id}/export?format=csv&gid={page_id}') data = r.content.decode('utf-8') df = pd.read_csv(StringIO(data)) # Do something # save as file file_ = StringIO() df.to_csv(file_, encoding='utf-8') # copy the result somewhere if __name__ == '__main__': run() python: my_function.py
  • 35. Building a data preparation pipeline with Pandas and AWS Lambda upload and test ▸ add your lambda function code to the environment zip. ▸ upload your function
  • 36. Building a data preparation pipeline with Pandas and AWS Lambda caveat 1: python 2.7 ▸ officially, only python 2.7 is supported ▸ But python 3 is available and can be called as a subprocess ▸ details here: http://www.cloudtrek.com.au/blog/ running-python-3-on-aws-lambda/
  • 37. Building a data preparation pipeline with Pandas and AWS Lambda caveat 2: max process memory (1.5GB) and execution time ▸ need to split the dataset if tool large ▸ loop over in your lambda call: ▸ may excess timeout ▸ map to multiple lambda calls ▸ need to merge the dataset at the end ▸ Lambda functions should be simple, chain if required
  • 39. Building a data preparation pipeline with Pandas and AWS Lambda Takeaways ▸ Know your data and your target ▸ Pandas can solve many issues ▸ Defensive programming and closing the loop ▸ AWS Lambda is a powerful and flexible tool for time and resource constrained teams