Title:
"Machine learning and Internet of Things, the future of medical prevention"
Abstract:
In this talk, Pierre Gutierrez, a data scientist at Dataiku, will discuss Dataiku's experiences using machine learning on IOT data. We will talk about the challenges processing and cleaning IoT data, and how to successfully train a model that can be deployed in production. We will illustrate our talk with two examples from our previous work. Creating algorithm for early epilepsy seizure detection based on wearable tech and Detecting people activity through sensor data.
3. Dataiku
• Founded in 2013
• 60 + employees
• Paris, New-York, London, San Francisco
Data Science Software Editor of Dataiku DSS
DESIGN
Load and prepare
your data
PREPARE
Build your
models
MODEL
Visualize and share
your work
ANALYSE
Re-execute your
workflow at ease
AUTOMATE
Follow your production
environment
MONITOR
Get predictions
in real time
SCORE
PRODUCTION
4. A data science workflow
Six steps to a predictive model
Data
Exploration &
Understanding
Data
Preparation
Model Creation
Evaluation Deployment
Data
Acquisition
Dataset
1
Scored
dataset
Model as
an API
Iteration 1
Iteration 2
Iteration n
Creating a predictive model is an highly
iterative process.
Data Science Studio enables its users to
create and manage these projects from
end-to-end.
This process is not industry specific, and
can be applied to many use cases.
Dataset
2
Dataset
n
Business/Problem
Understanding
Adapted from the CRISP-DM methodology
5. Epilepsy
Stats and figures
1-‐3%
of
the
popula/on
15.5
billion
Euros
/
year
spent
trea/ng
seizures
6
Types
of
epilepsy
Dozens
of
exis/ng
treatments
Days
to
weeks
of
hospital
/me
required
to
diagnose
9. Goals
Improve Epilepsy Diagnosis
1. Allow at-home EEG recording via wearable device
2. Detect seizures automatically
3. Detect spikes automatically
4. Shorten time-to-diagnosis for patients with epilepsy
10. Ageing
Stats and figures
x3
Over
last
60
years
12
million
fall
every
year
in
the
U.S
700
million
People
older
than
60
in
2006
28.5
%
of
this
popula/on
leave
alone
in
the
EU
Third
leading
Cause
of
death
:
strokes.
3/4
of
all
strokes
happen
to
people
over
65
12. Goals
Improve Falling Detection
1. Predict falls and detect strokes so that help may be summoned
2. Analyse eating behaviour - including whether people are taking
prescribed medication
3. Detect periods of depression or anxiety and intervene using a computer
based therapy
15. The Data
1024 Hz x 24 channels = 353 Mb / (hour x patient)
20 patients X 24 hours = 170 Gb
Nightly transfers of data from device to cloud
(via wifi)
We want to scale to hundreds of patients with
days of data
Epilepsy
16. The Data
• Accelerometer - Sampled at 20 Hz;
• RGB-D - Bounding box information
• Environmental - The values of passive infrared (PIR) sensors
Safe Aging
17. The Data
Needs
• Interpolation
• Missing data, synchronization fail
• Smart Sampling
• Zoom at different frequency levels
• Different sensors -> different frequency.
-> how to merge ?
• Aggregation
18. Time Series as Relational Data
Time Stamp
10001
10002
10003
10004
10005
Sensor1
40
-
-
43
42
Sensor2
-
50
55
20
-
Sensor3
30
34
60
-
40
Aggregation
Resampling
Interpolation
19. Time Series as JSON
{
"sensor1":
{
10001: 40,
10004: 43,
10005: 42
},
"sensor2":
{
10002: 50,
10003: 55,
10004: 20
}
…
}
Aggregation
Resampling
Interpolation
20. Time Series as Time Series
Time Series Database
Aggregation
Resampling
Interpolation
21. Signal Processing
Lots of libraries, lots of options
Rename
Generate
Rolling mean
Rolling max
Rolling min
Rolling median
Wavelet decomposition
STL decomposition
Peak detection
Low pass filter
High pass filter
Convolution
Correlation
Short-time FFT
Implemented with
common interface
+
+
25. Machine learning
Features
• Descriptive features
Epilepsy
Pa/ent
informa/on
:
EMR…
Safe
aging
EMR,
age,
height
• Time series features
Epilepsy
Current
values,
previous
values,
correla/ons,
Fourier,
Wavelets,
…
Safe
aging
Current
values,
previous
values,
rolling
averages,
…
26. Machine learning
Features
• More data Means less feature engineering
Safe
aging
(lot’s
of
values)
Xgboost
on
current
and
previous
values:
Let
the
model
find
the
interac/ons
Epilepsy
(millions
of
lines)
RNN,
LSTM.
Network
Architecture
=
Feature
engineering
28. Split 1: Awesome performance
4 patients, 4 readings from each patient
Training Testing
AUC = 0.94
29. Split 2: OK performance
4 patients, 4 readings from each patient
Training Testing
AUC = 0.70
30. Worries
About our spike detection model
Poor generalization to new patients
What about new devices?
What about different doctors creating
annotations?
Solution: more patients, more doctors, more devices
31. Worries
About our position detection model
Average generalization to new patients
What about new devices?
What about different home / rooms ?
Data Solution : more patients, more houses, more devices
Practical Solution : warm start with house + person. Expensive
33. Deploy
• Model Deployment
Epilepsy
Diagnosis
Batch
scoring
on
all
record
a_er
X
days
Epilepsy
Spike
detec:on
Batch
scoring
(used
for
diagnosis)
Epilepsy
seizure
detec:on
Real
/me
scoring
Safe
aging
Real
/me
scoring
every
second
Theory
• Maintain your feature flow !
34. Deploy
• Don’t underestimate real life conditions
• Anomalies
• Headset in wrong position
• Bracelet in wrong hand
• Hardware / sensors deficiency
Practice
• Challenge: go beyond clinical experiments
36. Summary
• IoT devices can improve early detection (epilepsy, fall,…)
• IoT devices produce lots of data – use databases made for IoT
• Standard workflow – acquire, visualize, prepare, model – can be replicated for IoT
devices using open source software
• Differences between patients remains a challenge for prediction algorithms
IoT devices for medical applications