Big Data Analytics for connected home: a few usecases, some important messages and a little example. Presentation given at CEA Cadarache - Cité des Nouvelles Energies at the strategic comittee of ARCSIS (http://www.arcsis.org/missions.html)
1. May 22, 2015
Data Science Consulting
Héloïse Nonne
Senior Data Scientist - Manager
Big Data Analytics for connected home
2. Data analytics for disconnected homes
2
𝑦𝑡 = 𝜇 + 𝜖 𝑡 + 𝜙1 𝑦𝑡−1 + ⋯ + 𝜙 𝑛 𝑦𝑡−𝑛 − 𝜃1 𝜖 𝑡−1 − ⋯ − 𝜃 𝑛 𝜖 𝑡−𝑛
ARIMA models
(AutoRegressive Integrated Moving Average)
𝑦𝑡 = electric load at time t
𝜖 𝑡 = noise at time t
• Very low frequency resolution for local
(household) measurements (< trimestrial)
• Only aggregated data (sum of individual
loads) for higher frequency
measurements (region, neighborhood)
• Data storage issues
• Computation power
• Limited knowledge at local
level
• Limited predictive power
• Complex sophisticated models
exist but are difficult to tune
3. • Sun
• Wind
• Cloud cover
• Humidity
• Temperature
Reducing electricity costs: a complete data ecosystem
3
Weather
Energy production
Energy price
Historical data
Actual measurement (real-time)
Forecast
• Appliances and
use
• Heating
• Electricity storage
• Elevators
• Doors / lights
• Network activity
-> current
occupation
• Renewable
energy
• Shutter
orientation
• Anthropologic
data
• Building structure
(thermal mass)
Electricity
demand ????
Regional / national scale
Local / neighborhood scale
Anthropologic data
• Energy consumption
patterns
Anthropologic data
• comfort temperature
• children at school
• activity of occupants
• Weekday /holiday
• Hour of day
4. Multiple sources of data for multiple models
• Volume
– vast amounts of data
– too large to store and analyse using
traditional technology
• Velocity
– speed at which new data is generated
– speed at which data change
• Variety
– types of data (number, text, images, video)
– types of sources (real-time, static)
• Veracity
– accuracy of data (frequency, errors)
– quality of data (sampling errors, typos)
4
5. Technology choices depend on the usecase
Transaction-oriented
• Write/Read
• Logs
• Transactions
Streaming-oriented
• Compute on the fly
• Reactivity
• Real-time decisions
Computationally intensive
• CPU/GPU bound
• Complex problem to solve
Storage-oriented
• Loads of data
• Analysis
• Algorithms
Hadoop
SQL
interactive
Tez
Mahout
Spark
Hbase
Cassa
ndra
HPC
Storm
Kafka
Spark
Hardware
Software
Need
Bank – Stock market
Web logs
In/out
Image recognition
Research on DNA,
…
Energy load management
Industrial processes
Aeronautics
Customers Web journey
Bank – Insurance
Customer management
Records, archiving
5
7. Many usecases
• Detect precarity (underheating)
• Detect people in distress (illnesses, elderly, heat wave, …)
• Improved safety (fire detection, security, …)
Business Society
Research / knowledge Sustainability
• Building optimization (thermal mass, isolation,
configuration, windows orientation)
• Consumption patterns
• Social behaviors
• Optimize use and storage of energy (light
management, applicances use, demand reduction, …)
• Improve comfort in neighborhood
• Reduce waste (energy, water, appliances)
• Scoring and customer segmentation
• Predict the demand in energy
• Predictive maintenance (elevators, HVAC, photovoltaic, ..)
• Cost reduction
But remain pragmatic and think about the whole picture
-> predictive maintenance on light bulbs ??!
7
8. Predictive maintenance
Data
• Shaft speed
• Vibrations (X, Y, Z)
• Sound measurements
• Rail vibrations
• Motor temperature
• Oil buffer
• …
Wear, failure
• Bearing fault
• Door: Shoe deformation
• Unbalance
• Misalignment
• Resonance
• …
Elevator maintenance
predict failure before breakage
Cost reduction and improvement of reliability through predictive maintenance
8
9. A predictive maintenance management system
• Continuous adaptation of diagnostic
• Build, increase and maintain knowledge
• Handle large quantity of data
• Handle uncertainty in diagnostic
• Assess fault severity
Requirements
• Symptoms are a mix of different causes
• Information is unclear
• Limited frequency resolution
• Missing data
• Noise
Challenges
Data center
Remote management
system
Richer knowledge
multiple
sources
9
10. Bayesian networks
• Compact representation of entities states or
events as random variables
• Contains knowledge about how states /events are
related
BF Bearing fault
DF Door deformation
WU Weight unbalance
RN Resonance
MA Misalignment
AYX
Vibration freq peak on axis A
at Y X
TP Temperature > x °C
SP Shaft speed freq peaks
SdB Sound > x dB
MA
RN
SP
SdB
BF
DF
WU
X1X X2X
Y1X Y2X
Z1X Z2XTP
• Qualitative = dependence relations
• Quantitative = the strengths of the relations
• Mix a priori knowledge with experimental (real-time) data
• Explanatory (human understanding of phenomena vs black-box
models)
• Uncertainty management (assessment of probability of failure)
• Possibility to learn
• Parameters
• Structures (events, entities, causes and effects)
AdvantagesBayesian network
Decision rules for
action
10
Absolute need of prior
knowledge from
professionals
11. Bayesian networks
MA
RN
SP
SdB
BF
DF
WU
X1X X2X
Y1X Y2X
Z1X Z2XTP
WU
True (failure) 0.60
False 0.40
Experience 10
A priori conditional probability table Update with new experience
P n + 1 =
(P n ∗ nb_experiences) + 1
nb_experiences + 1
WU
True (failure) 0.636
False 0.364
Experience 11
One can unlearn (forget the past (outdated) experiences)
by using fading tables
Add a fading factor in front of the oldest experiences
11
12. The big (data) picture
• Many sources of data: weather, energy production, economic, social, behavioral data, appliances characteristics,
current building occupation, activity, etc.
• Different scales: worldwide, regional, local, individual
• Different times: historical data, year, month, day, hour, real-time
• The system is not going to be perfect at once -> design it constant improvement
• A single model is useless: each model has its use and models feed each other with their knowledge and prediction
• Choose the right model and the right technology: according to usecase, time cost, energy cost,
pragmatism, realism
• Build models with the professionals who know the problem
-> build on existing knowledge
An efficient system implies close collaboration
business, researchers, manufacturers, maintainers, owners, users, developpers, data
scientists, data managers, optimization specialists, and end-users
12
13. Quantmetry – Spécialiste de la Data science
Agir
Prédire
Analyser
Stocker
Collecter
13
De plus en plus de data disponibles
Tout stocker!
Analyser pour mieux comprendre signaux forts et faibles
Prévoir ce qui peut advenir grâce aux tendances du passé
Automatiser la décision et l’action
Quantmetry accompagne ses clients sur l’ensemble des strates de la pyramide des données et
participe ainsi à leur transformation digitale par le quantitatif
pour des résultats concrets sur leur performance business.
• un cabinet de conseil « pure player » du Big Data et de la Data science dont le développement commercial a démarré en 2013
• des méthodes statistiques avancées, le machine learning et les technologies Big data
• 2014: 1,5 millions d’euros de chiffres d’affaire avec une forte ambition de croissance, en France et à l’étranger
• Une vingtaine de data scientists / consultants
14. Activités de Quantmetry
14
Optimisation Business par la Data
Structuration d’un Data Lab
Conseil Accompagnement Réalisation
• Détection et priorisation d’opportunités
par la data
• Construction de schéma d’architecture IT
• Retours d’expérience et bonnes pratiques
• Schéma d’organisation et de gouvernance
• Choix d’une architecture technologique
Conduite du changement
Conduite de projet
• Cadrage, projet d’industrialisation
• Méthodologie (modèles statistiques
et algorithmes)
• Technologies Big Data
• Montée en compétences
• Recrutement
• Gouvernance
Projets pilotes
Industrialisation
• Proof of concept de Data science
• Pilotes technologiques
• Industrialisation de pilotes (API, …)
• Création d’une architecture Big Data
et mise en place de flux de données
15. Veillle technologique et expérimentations
• Des thèmes d’investigation :
– Online learning
– Deep learning et réseaux de neurones
– Industrialisation
– Analyse sémantique
– Energie (analyse de séries temporelles)
– Smart cities
– Amélioration de l’expérience utilisateur
• Acteur de l’écosystème Big Data : participation à des
séminaires, conférences internationales, hackathons,
compétitions Kaggle, partenariats éditeurs… Collaborations
avec des laboratoires de recherche et des écoles.
15
• Création et développement de produits spécifiques autour des technologies Big Data
• Recherche et développement en Data science
16. Baseline
(régression
logistique)
Gradient
Boosting
Données
non
structurée
s
Feature
engineeri
ng
Lift =
2
Lift =
6
Quelques Références en Data science
16
Amélioration du lift pour la
conquête en banque des
clients assurés
Détection de churn pour un
opérateur télécom
0 20 40
URL page résilitation
Age
Groupe
Nb pages vues…
Durée session
Mise en place d’un Data
Lab pour un assureur
Analyse de comportements
pour une mutuelle
Optimisation d’un outil de pricing
pour un acteur de la distribution B2B Modèles prédictifs de
consommation d’énergie