SlideShare a Scribd company logo
1 of 60
Download to read offline
DATA SCIENCE MEETS
SOFTWARE DEVELOPMENT
Alexis Seigneurin - IpponTechnologies
Who I am
• Software engineer for 15 years
• Consultant at IpponTech in Paris, France
• Favorite subjects: Spark, Cassandra,Ansible, Docker
• @aseigneurin
• 200 software engineers in France and the US
• In the US: offices in DC, NYC and Richmond,Virginia
• Digital, Big Data and Cloud applications
• Java & Agile expertise
• Open-source projects: JHipster,Tatami, etc.
• @ipponusa
The project
• Data Innovation Lab of a large insurance company
• Data → Business value
• Team of 30 Data Scientists + Software Developers
Data Scientists
Who they are
&
How they work
Skill set of a Data Scientist
• Strong in:
• Science (maths / statistics)
• Machine Learning
• Analyzing data
• Good / average in:
• Programming
• Not good in:
• Software engineering
Programming languages
• Mostly Python, incl. frameworks:
• NumPy
• Pandas
• SciKit Learn
• SQL
• R
Development environments
• IPython Notebook
Development environments
• Dataiku
Machine Learning
• Algorithms:
• Logistic Regression
• Decision trees
• Random forests
• Implementations:
• Dataiku
• Scikit-Learn
• Vowpal Wabbit
Programmers
Who they are
&
How they work
http://xkcd.com/378/
Skill set of a Developer
• Strong in:
• Software engineering
• Programming
• Good / average in:
• Science (maths / statistics)
• Analyzing data
• Not good in:
• Machine Learning
How Developers work
• Programming languages
• Java
• Scala
• Development environment
• Eclipse
• IntelliJ IDEA
• Toolbox
• Maven
• …
A typical Data
Science project
In the Lab
Workflow
1. Data Cleansing
2. Feature Engineering
3. Train a Machine Learning model
1. Split the dataset: training/validation/test datasets
2. Train the model
4. Apply the model on new data
Data Cleansing
• Convert strings to numbers/booleans/…
• Parse dates
• Handle missing values
• Handle data in an incorrect format
• …
Feature Engineering
• Transform data into numerical features
• E.g.:
• A birth date → age
• Dates of phone calls → Number of calls
• Text →Vector of words
• 2 names → Levensthein distance
Machine Learning
• Train a model
• Test an algorithm with different
params
• Cross validation (Grid Search)
• Compare different algorithms, e.g.:
• Logistic regression
• Gradient boosting trees
• Random forest
Machine Learning
• Evaluate the accuracy of the
model
• Root Mean Square Error (RMSE)
• ROC curve
• …
• Examine predictions
• False positives, false negatives…
Industrialization
Cookbook
Disclaimer
• Context of this project:
• Not So Big Data (but Smart Data)
• No real-time workflows (yet?)
Distribute the
processing
R E C I P E # 1
Distribute the processing
• Data Scientists work with data samples
• No constraint on processing time
• Processing on the Data Scientist’s workstation
(IPython Notebook) or on a single server
(Dataiku)
Distribute the processing
• In production:
• H/W resources are constrained
• Large data sets to process
• Spark:
• Included in CDH
• DataFrames (Spark 1.3+) ≃ Pandas DataFrames
• Fast!
Use a centralized
data store
R E C I P E # 2
Use a centralized data store
• Data Scientists store data on their workstations
• Limited storage
• Data not shared within the team
• Data privacy not enforced
• Subject to data losses
Use a centralized data store
• Store data on HDFS:
• Hive tables (SQL)
• Parquet files
• Security: Kerberos + permissions
• Redundant + potentially unlimited storage
• Easy access from Spark and Dataiku
Rationalize the use of
programming
languages
R E C I P E # 3
Programming languages
• Data Scientists write code on their workstations
• This code may not run in the datacenter
• Language variety → Hard to share knowledge
Programming languages
• Use widely spread languages
• Spark in Python/Scala
• Support for R is too young
• Provide assistance to ease the adoption!
Use an IDE
R E C I P E # 4
Use an IDE
• Notebooks:
• Powerful for exploratory work
• Weak for code edition and code
structuring
• Inadequate for code versioning
Use an IDE
• IntelliJ IDEA / PyCharm
• Code compilation
• Refactoring
• Execution of unit tests
• Support for Git
Source Control
R E C I P E # 5
Source Control
• Data Scientists work on their workstations
• Code is not shared
• Code may be lost
• Intermediate versions are not preserved
• Lack of code review
Source Control
• Git + GitHub / GitLab
• Versioning
• Easy to go back to a version running in production
• Easy sharing (+permissions)
• Code review
Packaging the code
R E C I P E # 6
Packaging the code
• Source code has dependencies
• Dependencies in production ≠ at dev time
• Assemble the code + its dependencies
Packaging the code
• Freeze the dependencies:
• Scala → Maven
• Python → Setuptools
• Packaging:
• Scala → Jar (Maven Shade plugin)
• Python → Egg (Setuptools)
• Compliant with spark-submit.sh
R E C I P E # 7
Secure the build
process
Secure the build process
• Data Scientists may commit code… without
running tests first!
• Quality may decrease over time
• Packages built by hand on a workstation are not
reproducible
Secure the build process
• Jenkins
• Unit test report
• Code coverage report
• Packaging: Jar / Egg
• Dashboard
• Notifications (Slack + email)
Automate the process
R E C I P E # 8
Automate the process
• Data is loaded manually in HDFS:
• CSV files, sometimes compressed
• Often received by email
• Often samples
Automate the process
• No human intervention should be required
• All steps should be code / tools
• E.g. automate file transfers, unzipping…
Adapt to living data
R E C I P E # 9
Adapt to living data
• Data Scientists work with:
• Frozen data
• Samples
• Risks with data received on a regular basis:
• Incorrect format (dates, numbers…)
• Corrupt data (incl. encoding changes)
• Missing values
Adapt to living data
• Data Checking & Cleansing
• Preliminary steps before processing the data
• Decide what to do with invalid data
• Thetis
• Internal tool
• Performs most checking & cleansing operations
Provide a library of
transformations
R E C I P E # 1 0
Library of transformations
• Dataiku « shakers »:
• Parse dates
• Split a URL (protocol, host, path, …)
• Transform a post code into a city / department name
• …
• Cannot be used outside Dataiku
Library of transformations
• All transformations should be code
• Reuse transformations between projects
• Provide a library
• Transformation = DataFrame → DataFrame
• Unit tests
Unit test the data
pipeline
R E C I P E # 1 1
Unit test the data pipeline
• Independent data processing steps
• Data pipeline not often tested from beginning to
end
• Data pipeline easily broken
Unit test the data pipeline
• Unit test each data transformation stage
• Scala: Scalatest
• Python: Unittest
• Use mock data
• Compare DataFrames:
• No library (yet?)
• Compare lists of lists
Assemble the
Workflow
R E C I P E # 1 2
Assemble the Workflow
• Separate transformation processes:
• Transformations applied to some data
• Results are frozen and used in other processes
• Jobs are launched manually
• No built-in scheduler in Spark
Assemble the workflow
• Oozie:
• Spark
• Map-Reduce
• Shell
• …
• Scheduling
• Alerts
• Logs
Summary
&
Conclusion
Summary
• Keys:
• Use industrialization-ready tools
• Pair Programming: Data Scientist + Developer
• Success criteria:
• Lower time to market
• Higher processing speed
• More robust processes
Thank you!
@aseigneurin - @ipponusa

More Related Content

What's hot

Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Chris Fregly
 
Semi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentSemi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentDataWorks Summit
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?MapR Technologies
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine LearningChris Fregly
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzArchiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzDatabricks
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaDatabricks
 
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataAkka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataLightbend
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiCodemotion Dubai
 

What's hot (20)

Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
 
Semi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentSemi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial Environment
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzArchiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataAkka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
 

Viewers also liked

Viewers also liked (10)

Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
Are API Services Taking Over All the Interesting Data Science Problems?
Are API Services Taking Over All the Interesting Data Science Problems?Are API Services Taking Over All the Interesting Data Science Problems?
Are API Services Taking Over All the Interesting Data Science Problems?
 
Microsoft Cognitive Service, Tap into the Power of Machine Learning with Easy...
Microsoft Cognitive Service, Tap into the Power of Machine Learning with Easy...Microsoft Cognitive Service, Tap into the Power of Machine Learning with Easy...
Microsoft Cognitive Service, Tap into the Power of Machine Learning with Easy...
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar to Data Science meets Software Development

Reproducible research concepts and tools
Reproducible research concepts and toolsReproducible research concepts and tools
Reproducible research concepts and toolsC. Tobin Magle
 
Intake at AnacondaCon
Intake at AnacondaConIntake at AnacondaCon
Intake at AnacondaConMartin Durant
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineeringnathanmarz
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkAdaryl "Bob" Wakefield, MBA
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerThe Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerSarah Dutkiewicz
 
scrazzl - A technical overview
scrazzl - A technical overviewscrazzl - A technical overview
scrazzl - A technical overviewscrazzl
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Bradford Stephens
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemTuri, Inc.
 
Python performance profiling
Python performance profilingPython performance profiling
Python performance profilingJon Haddad
 

Similar to Data Science meets Software Development (20)

Reproducible research concepts and tools
Reproducible research concepts and toolsReproducible research concepts and tools
Reproducible research concepts and tools
 
Intake at AnacondaCon
Intake at AnacondaConIntake at AnacondaCon
Intake at AnacondaCon
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerThe Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
 
scrazzl - A technical overview
scrazzl - A technical overviewscrazzl - A technical overview
scrazzl - A technical overview
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData Ecosystem
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Python performance profiling
Python performance profilingPython performance profiling
Python performance profiling
 

More from Alexis Seigneurin

Data Quality Monitoring in Realtime and at Scale
Data Quality Monitoring in Realtime and at ScaleData Quality Monitoring in Realtime and at Scale
Data Quality Monitoring in Realtime and at ScaleAlexis Seigneurin
 
Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Alexis Seigneurin
 
Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Alexis Seigneurin
 
Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)Alexis Seigneurin
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairAlexis Seigneurin
 

More from Alexis Seigneurin (7)

Data Quality Monitoring in Realtime and at Scale
Data Quality Monitoring in Realtime and at ScaleData Quality Monitoring in Realtime and at Scale
Data Quality Monitoring in Realtime and at Scale
 
0712_Seigneurin
0712_Seigneurin0712_Seigneurin
0712_Seigneurin
 
Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)
 
Spark - Ippevent 19-02-2015
Spark - Ippevent 19-02-2015Spark - Ippevent 19-02-2015
Spark - Ippevent 19-02-2015
 
Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)
 
Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclair
 

Recently uploaded

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Data Science meets Software Development

  • 1. DATA SCIENCE MEETS SOFTWARE DEVELOPMENT Alexis Seigneurin - IpponTechnologies
  • 2. Who I am • Software engineer for 15 years • Consultant at IpponTech in Paris, France • Favorite subjects: Spark, Cassandra,Ansible, Docker • @aseigneurin
  • 3. • 200 software engineers in France and the US • In the US: offices in DC, NYC and Richmond,Virginia • Digital, Big Data and Cloud applications • Java & Agile expertise • Open-source projects: JHipster,Tatami, etc. • @ipponusa
  • 4. The project • Data Innovation Lab of a large insurance company • Data → Business value • Team of 30 Data Scientists + Software Developers
  • 5. Data Scientists Who they are & How they work
  • 6. Skill set of a Data Scientist • Strong in: • Science (maths / statistics) • Machine Learning • Analyzing data • Good / average in: • Programming • Not good in: • Software engineering
  • 7. Programming languages • Mostly Python, incl. frameworks: • NumPy • Pandas • SciKit Learn • SQL • R
  • 10. Machine Learning • Algorithms: • Logistic Regression • Decision trees • Random forests • Implementations: • Dataiku • Scikit-Learn • Vowpal Wabbit
  • 11. Programmers Who they are & How they work http://xkcd.com/378/
  • 12. Skill set of a Developer • Strong in: • Software engineering • Programming • Good / average in: • Science (maths / statistics) • Analyzing data • Not good in: • Machine Learning
  • 13. How Developers work • Programming languages • Java • Scala • Development environment • Eclipse • IntelliJ IDEA • Toolbox • Maven • …
  • 14. A typical Data Science project In the Lab
  • 15. Workflow 1. Data Cleansing 2. Feature Engineering 3. Train a Machine Learning model 1. Split the dataset: training/validation/test datasets 2. Train the model 4. Apply the model on new data
  • 16. Data Cleansing • Convert strings to numbers/booleans/… • Parse dates • Handle missing values • Handle data in an incorrect format • …
  • 17. Feature Engineering • Transform data into numerical features • E.g.: • A birth date → age • Dates of phone calls → Number of calls • Text →Vector of words • 2 names → Levensthein distance
  • 18. Machine Learning • Train a model • Test an algorithm with different params • Cross validation (Grid Search) • Compare different algorithms, e.g.: • Logistic regression • Gradient boosting trees • Random forest
  • 19. Machine Learning • Evaluate the accuracy of the model • Root Mean Square Error (RMSE) • ROC curve • … • Examine predictions • False positives, false negatives…
  • 21. Disclaimer • Context of this project: • Not So Big Data (but Smart Data) • No real-time workflows (yet?)
  • 23. Distribute the processing • Data Scientists work with data samples • No constraint on processing time • Processing on the Data Scientist’s workstation (IPython Notebook) or on a single server (Dataiku)
  • 24. Distribute the processing • In production: • H/W resources are constrained • Large data sets to process • Spark: • Included in CDH • DataFrames (Spark 1.3+) ≃ Pandas DataFrames • Fast!
  • 25. Use a centralized data store R E C I P E # 2
  • 26. Use a centralized data store • Data Scientists store data on their workstations • Limited storage • Data not shared within the team • Data privacy not enforced • Subject to data losses
  • 27. Use a centralized data store • Store data on HDFS: • Hive tables (SQL) • Parquet files • Security: Kerberos + permissions • Redundant + potentially unlimited storage • Easy access from Spark and Dataiku
  • 28. Rationalize the use of programming languages R E C I P E # 3
  • 29. Programming languages • Data Scientists write code on their workstations • This code may not run in the datacenter • Language variety → Hard to share knowledge
  • 30. Programming languages • Use widely spread languages • Spark in Python/Scala • Support for R is too young • Provide assistance to ease the adoption!
  • 31. Use an IDE R E C I P E # 4
  • 32. Use an IDE • Notebooks: • Powerful for exploratory work • Weak for code edition and code structuring • Inadequate for code versioning
  • 33. Use an IDE • IntelliJ IDEA / PyCharm • Code compilation • Refactoring • Execution of unit tests • Support for Git
  • 34. Source Control R E C I P E # 5
  • 35. Source Control • Data Scientists work on their workstations • Code is not shared • Code may be lost • Intermediate versions are not preserved • Lack of code review
  • 36. Source Control • Git + GitHub / GitLab • Versioning • Easy to go back to a version running in production • Easy sharing (+permissions) • Code review
  • 37. Packaging the code R E C I P E # 6
  • 38. Packaging the code • Source code has dependencies • Dependencies in production ≠ at dev time • Assemble the code + its dependencies
  • 39. Packaging the code • Freeze the dependencies: • Scala → Maven • Python → Setuptools • Packaging: • Scala → Jar (Maven Shade plugin) • Python → Egg (Setuptools) • Compliant with spark-submit.sh
  • 40. R E C I P E # 7 Secure the build process
  • 41. Secure the build process • Data Scientists may commit code… without running tests first! • Quality may decrease over time • Packages built by hand on a workstation are not reproducible
  • 42. Secure the build process • Jenkins • Unit test report • Code coverage report • Packaging: Jar / Egg • Dashboard • Notifications (Slack + email)
  • 43. Automate the process R E C I P E # 8
  • 44. Automate the process • Data is loaded manually in HDFS: • CSV files, sometimes compressed • Often received by email • Often samples
  • 45. Automate the process • No human intervention should be required • All steps should be code / tools • E.g. automate file transfers, unzipping…
  • 46. Adapt to living data R E C I P E # 9
  • 47. Adapt to living data • Data Scientists work with: • Frozen data • Samples • Risks with data received on a regular basis: • Incorrect format (dates, numbers…) • Corrupt data (incl. encoding changes) • Missing values
  • 48. Adapt to living data • Data Checking & Cleansing • Preliminary steps before processing the data • Decide what to do with invalid data • Thetis • Internal tool • Performs most checking & cleansing operations
  • 49. Provide a library of transformations R E C I P E # 1 0
  • 50. Library of transformations • Dataiku « shakers »: • Parse dates • Split a URL (protocol, host, path, …) • Transform a post code into a city / department name • … • Cannot be used outside Dataiku
  • 51. Library of transformations • All transformations should be code • Reuse transformations between projects • Provide a library • Transformation = DataFrame → DataFrame • Unit tests
  • 52. Unit test the data pipeline R E C I P E # 1 1
  • 53. Unit test the data pipeline • Independent data processing steps • Data pipeline not often tested from beginning to end • Data pipeline easily broken
  • 54. Unit test the data pipeline • Unit test each data transformation stage • Scala: Scalatest • Python: Unittest • Use mock data • Compare DataFrames: • No library (yet?) • Compare lists of lists
  • 55. Assemble the Workflow R E C I P E # 1 2
  • 56. Assemble the Workflow • Separate transformation processes: • Transformations applied to some data • Results are frozen and used in other processes • Jobs are launched manually • No built-in scheduler in Spark
  • 57. Assemble the workflow • Oozie: • Spark • Map-Reduce • Shell • … • Scheduling • Alerts • Logs
  • 59. Summary • Keys: • Use industrialization-ready tools • Pair Programming: Data Scientist + Developer • Success criteria: • Lower time to market • Higher processing speed • More robust processes