SlideShare a Scribd company logo
1 of 34
COURBOSPARK:
DECISION TREE FOR
TIME-SERIES ON SPARK
Christophe Salperwyck – EDF R&D
Simon Maby – OCTO Technology - @simonmaby
Xdata project: www.xdata.fr, grants from
"Investissement d'Avenir" program, 'Big Data' call
| 2
AGENDA
1. PROBLEM DESCRIPTION
2. IMPLEMENTATION
• Courbotree: presentation of the algorithm
• From mllib to courbospark
3. PERFORMANCES
• Configuration (cluster description, spark config…)
4. FEEDBACK ON SPARK/MLLIB
| 3
FRENCH METERS DATA
| 4
• 1 measure every 10 min
• 35 million customers
• Time-series: 144 points x 365 days
 Annual data volume: 1800 billion records, 120 TB
of raw data
BIG DATA!
| 5
LOAD CURVES CLASSIFICATION
Contract type Region … Equipment type Load Curve
9KVA 75 … Elec
6KVA 22 … Gas
… … … … …
12KVA 34 … Elec
| 6
WHY A DECISION TREE?
• Easy to understand
• Ability to explore the model
• Ability to choose the
expressivity of the model
| 7
Goal: find the most different curves depending on an explanatory
feature
How to split? we can either:
• Minimize curves dispersion (intra inertia)
or
• Maximize differences between average curves (inter inertia)
SPLIT CRITERIA: INERTIA
| 8
MAXIMIZE DIFFERENCES BETWEEN AVERAGE
CURVES (feature: Equipment Type)
Electrical
Gas
Hour
PinW
ArgMax(d)
mean
| 9
EXISTING DISTRIBUTED DECISION TREE
Scalable Distributed Decision Trees in Spark MLLib
Manish Amde (Origami Logic), Hirakendu Das (Yahoo! Inc.), Evan Sparks (UC Berkeley), Ameet
Talwalkar (UC Berkeley). Spark Summit 2014. http://spark-summit.org/wp-
content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-Das-Sparks-Talwalkar.pdf
A MapReduce Implementation of C4.5 Decision Tree Algorithm
Wei Dai, Wei Ji. International Journal of Database Theory and Application. Vol. 7, No. 1, 2014, pages 49-
60. http://www.chinacloud.cn/upload/2014-03/14031920373451.pdf
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
Biswanath Panda, Joshua S. Herbach, Sugato Basu, Roberto J. Bayardo. VLDB 2009.
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36296.pdf
Distributed Decision Tree Learning for Mining Big Data Streams
Arinto Murdopo, Master thesis, Yahoo ! Labs Barcelona, July 2013.
http://people.ac.upc.edu/leandro/emdc/arinto-emdc-thesis.pdf
| 10
MLLIB DECISION TREE PARALLELIZATION
| 11
Step 1:
compute average
curves
[0:10[ [10:20[ [0:10[ [10:20[ [0:10[ [10:20[
Host 1 Host 2 Host 3
[0:10[ [10:20[
Host 1
Step 2:
collect and find
the best split
HORIZONTAL STRATEGY
| 12
To build the tree:
• Criteria: entropy, Gini, variance
• Data structure: LabelPoint
FROM MLLIB TO COURBOSPARK
| 13
To build the tree:
• Criteria: entropy, Gini, variance, inertia (to compare time-series)
• Data structure: LabelPoint, TimeSeries
• Finding split point for nominal features
For data visualization of the tree:
• Quantile on the nodes and leaves
• Lost of inertia
• Number of curves per nodes, leaves
FROM MLLIB TO COURBOSPARK
| 14
DEALING WITH NOMINAL FEATURES
Current implementation for regression:
 order the categories by their mean on the target
A BC D
Partitions tested: {A}/{CBD}, {AC}/{BD}, {ACB}/{C}
| 15
NOMINAL VALUES: TYPE OF CONTRACT
4 CATEGORIES {A, B, C, D}
A B
C D?
| 16
DEALING WITH NOMINAL FEATURES
Hard to order curves…
Solution 1:
Compare curves 2 by 2  {A}/{BCD}, {AB}/{CD}, {ABC}/{D},
{AC}/{BD}…
Problem:
Combinatory problem depending on n the number of
different categories. Complexity is O(2n)
| 17
DEALING WITH NOMINAL FEATURES
Solution 2:
Agglomerative Hierarchical Clustering. Bottom up approach.
Complexity is O(n3) - we don’t expect n > 100
| 18
HOW TO
Algorithm parameters
Configure spark context
Load the data file
Learn the model
| 19
LOOKING FOR THE TEST CONFIGURATION
For a constant global capacity on 12 nodes:
•120 cores + 120 GB RAM
#Executors RAM per exec. Cores per exec. Performance on
100Gb data
12 10 GB 10 22 minutes
24 5 GB 5 17 minutes
60 2 GB 2 12 minutes
120 1 GB 1 15 minutes
| 20
SCALABILITY TO #CONTAINERS
| 21
SCALABILITY TO #CONTAINERS
| 22
SCALABILITY TO #CONTAINERS
| 23
SCALABILITY TO #LINES
| 24
FRAMEWORK STABILITY
Tested on:
• 10GB, 100GB, 200GB, 300GB,
400GB, 500GB, 1TB
• Categorical and continuous
variables
• Bin sizes from 100 to 1000
| 25
SCALABILITY TO #COLUMNS
| 26
SCALABILITY TO #CATEGORIES
| 27
| 28
REAL LIFE DATASET
0
50
100
150
200
250
300
350
400
0 200 400 600 800 1000 1200 1400
Timeinminutes
Data in GB
• 9 executors with 20 GB and 8 cores
• 10 to 1000 millions load curves (10 numerical and 10 categorical features)
| 29
• spark.default.parallelism
• spark.executor.memory
• spark.storage.memoryfraction
• spark.akka.framesize
TUNING
| 30
Developers view
• Flawless transition from local to cluster mode
• Debug mode with an IDE
• Good performances need knowledge
FEEDBACKS
| 31
HEY SCALA <3
| 32
Data Scientists view
• The API is not very data oriented
• …but now we have SparkSQL and Dataframes!
• IPython + pySpark
• Feature engineering VS model engineering
FEEDBACKS
| 33
OPS view
• Better than mapReduce
• Performances are predictable for tested code
• YARNed
• Lots of releases, MlLib code is evolving quickly
FEEDBACKS
| 34
FUTURE WORKS
• Unbalanced trees
• Improve performance
• Other criteria for time-series comparison
• Missing values in explanatory features

More Related Content

What's hot

Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
 
Creating fishnet using_arc_gis
Creating fishnet using_arc_gisCreating fishnet using_arc_gis
Creating fishnet using_arc_gis
Ashok Peddi
 
Neo4j spatial-nosql-frankfurt
Neo4j spatial-nosql-frankfurtNeo4j spatial-nosql-frankfurt
Neo4j spatial-nosql-frankfurt
Peter Neubauer
 

What's hot (14)

Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Creating fishnet using_arc_gis
Creating fishnet using_arc_gisCreating fishnet using_arc_gis
Creating fishnet using_arc_gis
 
Processing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechProcessing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTech
 
DARTS: Differentiable Architecture Search at 社内論文読み会
DARTS: Differentiable Architecture Search at 社内論文読み会DARTS: Differentiable Architecture Search at 社内論文読み会
DARTS: Differentiable Architecture Search at 社内論文読み会
 
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S..."Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...
"Moving CNNs from Academic Theory to Embedded Reality," a Presentation from S...
 
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsEnabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
 
Kaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKaggle boschコンペ振り返り
Kaggle boschコンペ振り返り
 
Processing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechProcessing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtech
 
Neo4j spatial-nosql-frankfurt
Neo4j spatial-nosql-frankfurtNeo4j spatial-nosql-frankfurt
Neo4j spatial-nosql-frankfurt
 
Geo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDXGeo Package and OWS Context at FOSS4G PDX
Geo Package and OWS Context at FOSS4G PDX
 
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
 
GeoMesa LocationTech DC
GeoMesa LocationTech DCGeoMesa LocationTech DC
GeoMesa LocationTech DC
 
Advanced Cartographic Map Rendering In GeoServer
Advanced Cartographic Map Rendering In GeoServerAdvanced Cartographic Map Rendering In GeoServer
Advanced Cartographic Map Rendering In GeoServer
 
GeoMesa: Scalable Geospatial Analytics
GeoMesa:  Scalable Geospatial AnalyticsGeoMesa:  Scalable Geospatial Analytics
GeoMesa: Scalable Geospatial Analytics
 

Viewers also liked

Petit-déjeuner OCTO - Management 3.0, au-delà du buzz word
Petit-déjeuner OCTO - Management 3.0, au-delà du buzz wordPetit-déjeuner OCTO - Management 3.0, au-delà du buzz word
Petit-déjeuner OCTO - Management 3.0, au-delà du buzz word
OCTO Technology
 
Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...
Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...
Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...
OCTO Technology
 
Petit-déjeuner OCTO du 03/06/2014 - La Révolution digitale
Petit-déjeuner OCTO du 03/06/2014 - La Révolution digitalePetit-déjeuner OCTO du 03/06/2014 - La Révolution digitale
Petit-déjeuner OCTO du 03/06/2014 - La Révolution digitale
OCTO Technology
 
Petit-déjeuner OCTO - Objets connectés : We Are Able !
Petit-déjeuner OCTO - Objets connectés : We Are Able !Petit-déjeuner OCTO - Objets connectés : We Are Able !
Petit-déjeuner OCTO - Objets connectés : We Are Able !
OCTO Technology
 

Viewers also liked (20)

Petit-déjeuner OCTO - Management 3.0, au-delà du buzz word
Petit-déjeuner OCTO - Management 3.0, au-delà du buzz wordPetit-déjeuner OCTO - Management 3.0, au-delà du buzz word
Petit-déjeuner OCTO - Management 3.0, au-delà du buzz word
 
Petit-déjeuner OCTO - Changez de Mindset : pensez produit !
Petit-déjeuner OCTO - Changez de Mindset : pensez produit !Petit-déjeuner OCTO - Changez de Mindset : pensez produit !
Petit-déjeuner OCTO - Changez de Mindset : pensez produit !
 
Petit-déjeuner "Cultiver l'art du code de qualité... Afin de livrer plus vite...
Petit-déjeuner "Cultiver l'art du code de qualité... Afin de livrer plus vite...Petit-déjeuner "Cultiver l'art du code de qualité... Afin de livrer plus vite...
Petit-déjeuner "Cultiver l'art du code de qualité... Afin de livrer plus vite...
 
Petit-déjeuner OCTO - L'Infra au service de ses projets
Petit-déjeuner OCTO - L'Infra au service de ses projetsPetit-déjeuner OCTO - L'Infra au service de ses projets
Petit-déjeuner OCTO - L'Infra au service de ses projets
 
Hackathon, 3 jours chez les bricoleurs
Hackathon, 3 jours chez les bricoleursHackathon, 3 jours chez les bricoleurs
Hackathon, 3 jours chez les bricoleurs
 
Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...
Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...
Petit-déjeuner "Secteur Public : Retour d'expérience sur la refonte en agile ...
 
#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon
#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon
#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon
 
Petit-déjeuner "Psychanalyse du Chatbot"
Petit-déjeuner "Psychanalyse du Chatbot"Petit-déjeuner "Psychanalyse du Chatbot"
Petit-déjeuner "Psychanalyse du Chatbot"
 
Solution de transfert mobile - Formats d'échange
Solution de transfert mobile - Formats d'échangeSolution de transfert mobile - Formats d'échange
Solution de transfert mobile - Formats d'échange
 
Ludovic cinquin octo - devoxx fr 2015 - les idées reçues de l'informatiqu...
Ludovic cinquin   octo - devoxx fr 2015 - les idées reçues de l'informatiqu...Ludovic cinquin   octo - devoxx fr 2015 - les idées reçues de l'informatiqu...
Ludovic cinquin octo - devoxx fr 2015 - les idées reçues de l'informatiqu...
 
Petit-déjeuner OCTO : Culture Hacking
Petit-déjeuner OCTO : Culture HackingPetit-déjeuner OCTO : Culture Hacking
Petit-déjeuner OCTO : Culture Hacking
 
VERS UNE BANQUE MOBILE ? CHAPITRE 1 : Concevoir un produit bancaire remarquable
VERS UNE BANQUE MOBILE ? CHAPITRE 1 : Concevoir un produit bancaire remarquableVERS UNE BANQUE MOBILE ? CHAPITRE 1 : Concevoir un produit bancaire remarquable
VERS UNE BANQUE MOBILE ? CHAPITRE 1 : Concevoir un produit bancaire remarquable
 
La Banque de demain : Chapitre 4
La Banque de demain : Chapitre 4 La Banque de demain : Chapitre 4
La Banque de demain : Chapitre 4
 
La banque de demain : quelles évolutions pour le modèle bancaire ?
La banque de demain : quelles évolutions pour le modèle bancaire ?La banque de demain : quelles évolutions pour le modèle bancaire ?
La banque de demain : quelles évolutions pour le modèle bancaire ?
 
La Banque de demain, chapitre 3. L'open-banking : l'enjeu clé pour l'innovati...
La Banque de demain, chapitre 3. L'open-banking : l'enjeu clé pour l'innovati...La Banque de demain, chapitre 3. L'open-banking : l'enjeu clé pour l'innovati...
La Banque de demain, chapitre 3. L'open-banking : l'enjeu clé pour l'innovati...
 
Petit-déjeuner OCTO Management 3.0 - Le Book
Petit-déjeuner OCTO Management 3.0 - Le BookPetit-déjeuner OCTO Management 3.0 - Le Book
Petit-déjeuner OCTO Management 3.0 - Le Book
 
Petit-déjeuner OCTO du 03/06/2014 - La Révolution digitale
Petit-déjeuner OCTO du 03/06/2014 - La Révolution digitalePetit-déjeuner OCTO du 03/06/2014 - La Révolution digitale
Petit-déjeuner OCTO du 03/06/2014 - La Révolution digitale
 
#PortraitDeCDO - Juliette De Maupeou - Total
#PortraitDeCDO - Juliette De Maupeou - Total#PortraitDeCDO - Juliette De Maupeou - Total
#PortraitDeCDO - Juliette De Maupeou - Total
 
Petit-déjeuner OCTO - Objets connectés : We Are Able !
Petit-déjeuner OCTO - Objets connectés : We Are Able !Petit-déjeuner OCTO - Objets connectés : We Are Able !
Petit-déjeuner OCTO - Objets connectés : We Are Able !
 
Engineering Data Scientist
Engineering Data ScientistEngineering Data Scientist
Engineering Data Scientist
 

Similar to Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning sur les séries temporelles

40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
Romeo Kienzler
 

Similar to Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning sur les séries temporelles (20)

CourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkCourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on Spark
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarPurpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
 
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic Cloud
 
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic Cloud
 
The SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data ProcessorThe SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data Processor
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
 
CONDOR @ NGCLE@e-Novia 15.11.2017
CONDOR @ NGCLE@e-Novia 15.11.2017CONDOR @ NGCLE@e-Novia 15.11.2017
CONDOR @ NGCLE@e-Novia 15.11.2017
 
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
 
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
StorPool Presents at Cloud Field Day 9
StorPool Presents at Cloud Field Day 9StorPool Presents at Cloud Field Day 9
StorPool Presents at Cloud Field Day 9
 
Welcome to the Datasphere – the next level of storage
Welcome to the Datasphere – the next level of storageWelcome to the Datasphere – the next level of storage
Welcome to the Datasphere – the next level of storage
 

More from OCTO Technology

More from OCTO Technology (20)

Le Comptoir OCTO - Se conformer à la CSRD : un levier d'action insoupçonné
Le Comptoir OCTO - Se conformer à la CSRD : un levier d'action insoupçonnéLe Comptoir OCTO - Se conformer à la CSRD : un levier d'action insoupçonné
Le Comptoir OCTO - Se conformer à la CSRD : un levier d'action insoupçonné
 
Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloud
Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloudLe Comptoir OCTO - MLOps : Les patterns MLOps dans le cloud
Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloud
 
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...
 
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...
 
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...
 
OCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeursOCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeurs
 
OCTO Talks - Lancement du livre Culture Test
OCTO Talks - Lancement du livre Culture TestOCTO Talks - Lancement du livre Culture Test
OCTO Talks - Lancement du livre Culture Test
 
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...
 
OCTO Talks - State of the art Architecture dans les frontend web
OCTO Talks - State of the art Architecture dans les frontend webOCTO Talks - State of the art Architecture dans les frontend web
OCTO Talks - State of the art Architecture dans les frontend web
 
Refcard GraphQL
Refcard GraphQLRefcard GraphQL
Refcard GraphQL
 
Comptoir OCTO ALD Automotive/Leaseplan
Comptoir OCTO ALD Automotive/LeaseplanComptoir OCTO ALD Automotive/Leaseplan
Comptoir OCTO ALD Automotive/Leaseplan
 
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ?
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ? Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ?
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ?
 
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...
 
Le Comptoir OCTO - Affinez vos forecasts avec la planification distribuée et...
Le Comptoir OCTO -  Affinez vos forecasts avec la planification distribuée et...Le Comptoir OCTO -  Affinez vos forecasts avec la planification distribuée et...
Le Comptoir OCTO - Affinez vos forecasts avec la planification distribuée et...
 
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conception
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conceptionLe Comptoir OCTO - La formation au cœur de la stratégie d’éco-conception
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conception
 
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...
 
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone : les solutions E...
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone :  les solutions E...Le Comptoir OCTO - L'avenir de la gestion du bilan carbone :  les solutions E...
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone : les solutions E...
 
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...
 
RefCard Tests sur tous les fronts
RefCard Tests sur tous les frontsRefCard Tests sur tous les fronts
RefCard Tests sur tous les fronts
 
RefCard RESTful API Design
RefCard RESTful API DesignRefCard RESTful API Design
RefCard RESTful API Design
 

Recently uploaded

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

Recently uploaded (20)

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning sur les séries temporelles

  • 1. COURBOSPARK: DECISION TREE FOR TIME-SERIES ON SPARK Christophe Salperwyck – EDF R&D Simon Maby – OCTO Technology - @simonmaby Xdata project: www.xdata.fr, grants from "Investissement d'Avenir" program, 'Big Data' call
  • 2. | 2 AGENDA 1. PROBLEM DESCRIPTION 2. IMPLEMENTATION • Courbotree: presentation of the algorithm • From mllib to courbospark 3. PERFORMANCES • Configuration (cluster description, spark config…) 4. FEEDBACK ON SPARK/MLLIB
  • 4. | 4 • 1 measure every 10 min • 35 million customers • Time-series: 144 points x 365 days  Annual data volume: 1800 billion records, 120 TB of raw data BIG DATA!
  • 5. | 5 LOAD CURVES CLASSIFICATION Contract type Region … Equipment type Load Curve 9KVA 75 … Elec 6KVA 22 … Gas … … … … … 12KVA 34 … Elec
  • 6. | 6 WHY A DECISION TREE? • Easy to understand • Ability to explore the model • Ability to choose the expressivity of the model
  • 7. | 7 Goal: find the most different curves depending on an explanatory feature How to split? we can either: • Minimize curves dispersion (intra inertia) or • Maximize differences between average curves (inter inertia) SPLIT CRITERIA: INERTIA
  • 8. | 8 MAXIMIZE DIFFERENCES BETWEEN AVERAGE CURVES (feature: Equipment Type) Electrical Gas Hour PinW ArgMax(d) mean
  • 9. | 9 EXISTING DISTRIBUTED DECISION TREE Scalable Distributed Decision Trees in Spark MLLib Manish Amde (Origami Logic), Hirakendu Das (Yahoo! Inc.), Evan Sparks (UC Berkeley), Ameet Talwalkar (UC Berkeley). Spark Summit 2014. http://spark-summit.org/wp- content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-Das-Sparks-Talwalkar.pdf A MapReduce Implementation of C4.5 Decision Tree Algorithm Wei Dai, Wei Ji. International Journal of Database Theory and Application. Vol. 7, No. 1, 2014, pages 49- 60. http://www.chinacloud.cn/upload/2014-03/14031920373451.pdf PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Biswanath Panda, Joshua S. Herbach, Sugato Basu, Roberto J. Bayardo. VLDB 2009. http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36296.pdf Distributed Decision Tree Learning for Mining Big Data Streams Arinto Murdopo, Master thesis, Yahoo ! Labs Barcelona, July 2013. http://people.ac.upc.edu/leandro/emdc/arinto-emdc-thesis.pdf
  • 10. | 10 MLLIB DECISION TREE PARALLELIZATION
  • 11. | 11 Step 1: compute average curves [0:10[ [10:20[ [0:10[ [10:20[ [0:10[ [10:20[ Host 1 Host 2 Host 3 [0:10[ [10:20[ Host 1 Step 2: collect and find the best split HORIZONTAL STRATEGY
  • 12. | 12 To build the tree: • Criteria: entropy, Gini, variance • Data structure: LabelPoint FROM MLLIB TO COURBOSPARK
  • 13. | 13 To build the tree: • Criteria: entropy, Gini, variance, inertia (to compare time-series) • Data structure: LabelPoint, TimeSeries • Finding split point for nominal features For data visualization of the tree: • Quantile on the nodes and leaves • Lost of inertia • Number of curves per nodes, leaves FROM MLLIB TO COURBOSPARK
  • 14. | 14 DEALING WITH NOMINAL FEATURES Current implementation for regression:  order the categories by their mean on the target A BC D Partitions tested: {A}/{CBD}, {AC}/{BD}, {ACB}/{C}
  • 15. | 15 NOMINAL VALUES: TYPE OF CONTRACT 4 CATEGORIES {A, B, C, D} A B C D?
  • 16. | 16 DEALING WITH NOMINAL FEATURES Hard to order curves… Solution 1: Compare curves 2 by 2  {A}/{BCD}, {AB}/{CD}, {ABC}/{D}, {AC}/{BD}… Problem: Combinatory problem depending on n the number of different categories. Complexity is O(2n)
  • 17. | 17 DEALING WITH NOMINAL FEATURES Solution 2: Agglomerative Hierarchical Clustering. Bottom up approach. Complexity is O(n3) - we don’t expect n > 100
  • 18. | 18 HOW TO Algorithm parameters Configure spark context Load the data file Learn the model
  • 19. | 19 LOOKING FOR THE TEST CONFIGURATION For a constant global capacity on 12 nodes: •120 cores + 120 GB RAM #Executors RAM per exec. Cores per exec. Performance on 100Gb data 12 10 GB 10 22 minutes 24 5 GB 5 17 minutes 60 2 GB 2 12 minutes 120 1 GB 1 15 minutes
  • 20. | 20 SCALABILITY TO #CONTAINERS
  • 21. | 21 SCALABILITY TO #CONTAINERS
  • 22. | 22 SCALABILITY TO #CONTAINERS
  • 24. | 24 FRAMEWORK STABILITY Tested on: • 10GB, 100GB, 200GB, 300GB, 400GB, 500GB, 1TB • Categorical and continuous variables • Bin sizes from 100 to 1000
  • 26. | 26 SCALABILITY TO #CATEGORIES
  • 27. | 27
  • 28. | 28 REAL LIFE DATASET 0 50 100 150 200 250 300 350 400 0 200 400 600 800 1000 1200 1400 Timeinminutes Data in GB • 9 executors with 20 GB and 8 cores • 10 to 1000 millions load curves (10 numerical and 10 categorical features)
  • 29. | 29 • spark.default.parallelism • spark.executor.memory • spark.storage.memoryfraction • spark.akka.framesize TUNING
  • 30. | 30 Developers view • Flawless transition from local to cluster mode • Debug mode with an IDE • Good performances need knowledge FEEDBACKS
  • 32. | 32 Data Scientists view • The API is not very data oriented • …but now we have SparkSQL and Dataframes! • IPython + pySpark • Feature engineering VS model engineering FEEDBACKS
  • 33. | 33 OPS view • Better than mapReduce • Performances are predictable for tested code • YARNed • Lots of releases, MlLib code is evolving quickly FEEDBACKS
  • 34. | 34 FUTURE WORKS • Unbalanced trees • Improve performance • Other criteria for time-series comparison • Missing values in explanatory features