The document discusses methodologies for data science and the Internet of Things (IoT). It begins by noting that there is currently no single agreed upon methodology for solving data science problems for IoT (IoT analytics). It then poses some initial questions on whether a distinct IoT data science methodology is needed, and if IoT problems warrant a specific approach. While IoT data science problems are similar to general data science problems, the document notes there are some unique considerations for IoT, such as the use of hardware, high data volumes, and streaming data.
4. Copyright : Futuretext Ltd. London3
Ajit Jaokar
-
Data Science for IoT @Oxford Uni + UPM(Smart cities) + Online
Next book part of Stanford Uni course
In 2015, Ajit was included in 16 Top Data Science bloggers on Data Science
Central, Top 100 blogs on KDnuggets and Top 50 people to follow on Twitter by
IoT central for IoT.
World Economic Forum Spoken at MWC(5 times), CEBIT, CTIA, Web 2.0, CNN,
BBC, Oxford Uni, Uni St Gallen, European Parliament. @feynlabs – teaching
kids Computer Science. Adivsory – Connected Liverpool
www.opengardensblog.futuretext.com
5. Copyright : Futuretext Ltd. London4
Data Science for Internet of Things – practitioner course – March
2016
Now running in it’s second batch ..
Welcome to the world’s first course that helps you to become a
Data Scientist for the Internet Of Things ..
6. Copyright : Futuretext Ltd. London5
Ajit Jaokar
The Big Picture – The Data Science and IoT landscape
7. Copyright : Futuretext Ltd. London
Internet of Things
CNN,
RNN
Data Lake
Event
Based
analysis
Rules/
Workflow
Edge
Processing
Engine Rules/
Workflow
Alerts
Trigger s
Actions
Cloud / Data LakeEdge Device
Event
Collector
Predictive Alerts
Stream Processing System
Event
Store
Analytics
Model
Build Model
HDFS
Batch Processing System
Validate
Event
Sequence
CNN,
RNN
Data Lake
Event
Based
analysis
CEP
CEP
CEP
9. Copyright : Futuretext Ltd. London8
As the term Internet of Things implies (IOT) – IOT is about Smart
objects
For an object (say a chair) to be ‘smart’ it must have three things
- An Identity (to be uniquely identifiable – via iPv6)
- A communication mechanism(i.e. a radio) and
- A set of sensors / actuators
+
Physical context(ex location)
Social context
+
Decisions at the ‘edge’ ex with sensor fusion and even in offline mode
Workflow – (IFTTT) often also at the edge –
Thus, IOT is all about Data ..
IoT != M2M (M2M is a subset of IoT)
10. Copyright : Futuretext Ltd. London9
Ajit Jaokar
Many of the consumer IOT cases will happen with iBeacon in the next
two years
11. Copyright : Futuretext Ltd. London10
Ajit Jaokar
And 5G will provide the WAN connectivity 5G - Source – Ericsson
12. Copyright : Futuretext Ltd. London
Closed Loop Message –
Response System
Senso
rs
Rules/
Workflow
Edge Processor
Rules/
Workflow
Analytic Workbench: Operational
Investigative, Predictive Analytics
and Machine Learning
Possible
Specialized Store
Enterprise Apps:
ERP, CRM, and
other enterprise
apps
Alerts
Trigger
Actions
Cloud Based
Central Repository
Source: http://events.linuxfoundation.org/sites/events/files/slides/EdgeProcessing-
allseenalliance_4x3_template_24sept2014.pdf
13. Copyright : Futuretext Ltd. London12
iOt relates to Automation in three key areas based on Sensing and Predicting
a) Move from exception handling to patterns of exceptions over time.(are
some exceptions occurring repeatedly? Do I need to redsign my product, Is that a
new product?) –
b) Move from optimization to disruption – ownership to rental ship (Where are all
these dynamic assets?)
c) Move to self learning: Robotics: From assembly line to self learning
robots(Boston Dynamics), autonomous helicopters
14. Copyright : Futuretext Ltd. London13
Machines generate Data - Types of Big Data
Status Data almost everything will have a status data. This will create
vast amounts of data – much of it will be summarized at the ‘edge’
Location Data: Almost everything will have location data even if that
location is static. Things will be in transit (where is my product/car etc etc)
Machines taking action: Thermostat is automatically reduced
Actionable Data: Data in human actionable form – workflow – IFTTT
Machines learning by themselves in areas where there are no
‘rules’ – Most interesting space – best example is Deep Learning
15. Copyright : Futuretext Ltd. London14
Data Science for IoT: The role of hardware in analytics
Processing at the Edge (which Cisco and others have called Fog Computing).
Alternately, we see entirely new classes of hardware specifically involved in
Data Science for IoT(such as synapse chip for Deep learning)
17. Copyright : Futuretext Ltd. London16
Different Data Formats
POS data
Social media
External feeds
Payments
Log data
Telephone
conversations
RFID Scans
Events
Emails
Sensors
Free-form text
Geospatial
Audio
Still images/videos
Transactions
Call center notes
Adapted from Ravi Kalakota PhD
18. Copyright : Futuretext Ltd. London
IoT Reference Stack
Portal Dashboard
API
Manageme
ntEvent Processing and Analytics
Aggregation / Bus Layer
ESB and Message Broker
Devices
Communications
MQTT / HTTP/COAP
DeviceMgr
Identity&AccessManagement
Protocols
Standards
Industrial Internet Consumer Governance
Smart
Grid
Manufacturi
ng
Logistic&
Transpor
tation
Robotics
Connecte
d Car
Wearabl
es
Health
Public
Safety
Smart
Cities
Retail
19. Copyright : Futuretext Ltd. London
Multiple Protocols of IOT
HTTP/ REST, MQTT, COAP, etc
TCP, UDP
IPV6, IPV6 w 6LOWPAN, etc
Wireless (802.15.4, Wifi, BLE,
etc.)
Higher layer protocols
‒ Application
‒ Transport
‒ Network
Higher layer protocols
‒ Link layer
29. Copyright : Futuretext Ltd. London28
What is Machine Learning?
Mitchell's Machine Learning
Tom Mitchell in his book Machine Learning “The field of machine learning is c
oncerned with the question of how to construct computer
programs that automatically improve with experience.”
formally:
“A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E.”
Think of it as a design tool where we need to understand:
What data to collect for the experience (E)
What decisions the software needs to make (T) and
How we will evaluate its results (P).
A programmers perspective:
Machine Learning involves:
a) Training of a model from data
b) Predicts/ Extrapolates a decision
c) Against a performance measure.
30. Copyright : Futuretext Ltd. London29
Technique Applicability Algorithms
Classification Most commonly used
technique for predicting a
specific outcome such as
response / no-response, high /
medium / low-value
customer, likely to buy / not
buy.
Logistic Regression —classic
statistical technique but now
available inside the Oracle
Database and supports text
and transactional data
Naive Bayes —Fast, simple,
commonly applicable
Support Vector Machine—
Next generation, supports text
and wide data
Decision Tree —Popular,
provides human-readable
rules
Source: Oracle
31. Copyright : Futuretext Ltd. London30
Regression Technique for predicting
a continuous numerical
outcome such as customer
lifetime value, house
value, process yield rates.
Multiple Regression —
classic statistical
technique but now
available inside the
Oracle Database and
supports text and
transactional data
Support Vector Machine
—Next generation,
supports text and wide
data
Attribute Importance Ranks attributes
according to strength of
relationship with target
attribute. Use cases
include finding factors
most associated with
customers who respond to
an offer, factors most
associated with healthy
patients.
Minimum Description
Length—Considers each
attribute as a simple
predictive model of the
target class
Source: Oracle
32. Copyright : Futuretext Ltd. London31
Anomaly Detection Identifies unusual or
suspicious cases based on
deviation from the norm.
Common examples include
health care fraud, expense
report fraud, and tax
compliance.
One-Class Support Vector
Machine —Trains on
"normal" cases to flag
unusual cases
Clustering Useful for exploring data and
finding natural groupings.
Members of a cluster are
more like each other than
they are like members of a
different cluster. Common
examples include finding
new customer segments, and
life sciences discovery.
Enhanced K-Means—
Supports text mining,
hierarchical clustering,
distance based
Orthogonal Partitioning
Clustering—Hierarchical
clustering, density based
Expectation Maximization—
Clustering technique that
performs well in mixed data
(dense and sparse) data
mining problems.
Source: Oracle
33. Copyright : Futuretext Ltd. London32
Association Finds rules associated with
frequently co-occuring
items, used for market
basket analysis, cross-sell,
root cause analysis. Useful
for product bundling, in-
store placement, and defect
analysis.
Apriori—Industry standard
for market basket analysis
Feature Selection and Extraction Produces new attributes as
linear combination of
existing attributes.
Applicable for text data,
latent semantic analysis,
data compression, data
decomposition and
projection, and pattern
recognition.
Non-negative Matrix
Factorization—Next
generation, maps the
original data into the new
set of attributes
Principal Components
Analysis (PCA)—creates
new fewer composite
attributes that respresent
all the attributes.
Singular Vector
Decomposition—
established feature
extraction method that has
a wide range of
applications.
Source: Oracle
34. Copyright : Futuretext Ltd. London33
Ajit Jaokar
KEY CONCEPTS – DATA SCIENCE AND IOT
Deep learning
Big Data
Complex event Processing
Streaming
36. Copyright : Futuretext Ltd. London
Internet of Things
CNN,
RNN
Data Lake
Event
Based
analysis
Rules/
Workflow
Edge
Processing
Engine Rules/
Workflow
Alerts
Trigger s
Actions
Cloud / Data LakeEdge Device
Event
Collector
Predictive Alerts
Stream Processing System
Event
Store
Analytics
Model
Build Model
HDFS
Batch Processing System
Validate
Event
Sequence
CNN,
RNN
Data Lake
Event
Based
analysis
CEP
CEP
CEP
38. Copyright : Futuretext Ltd. London37
In a groundbreaking paper published today in Nature, a team of
researchers led by DeepMind co-founder Demis Hassabis reported
developing a deep neural network that was able to learn to play such
games at an expert level. What makes this achievement all the more
impressive is that the program was not given any background
knowledge about the games. It just had access to the score and the
pixels on the screen.
It didn’t know about bats, balls, lasers or any of the other things we
humans need to know about in order to play the games.
But by playing lots and lots of games many times over, the computer
learnt first how to play, and then how to play well.
39. Copyright : Futuretext Ltd. London38
Deep Learning and Feature learning
Deep Learning can be hence seen as a more complete, hierarchical and a
‘bottom up’ way for feature extraction and without human intervention.
Source: ELEG 5040 Advanced Topics on Signal Processing (Introduction to
Deep Learning) by Xiaogang Wang
42. Copyright : Futuretext Ltd. London
Internet of Things
CNN,
RNN
Data Lake
Event
Based
analysis
Rules/
Workflow
Edge
Processing
Engine Rules/
Workflow
Alerts
Trigger s
Actions
Cloud / Data LakeEdge Device
Event
Collector
Predictive Alerts
Stream Processing System
Event
Store
Analytics
Model
Build Model
HDFS
Batch Processing System
Validate
Event
Sequence
CNN,
RNN
Data Lake
Event
Based
analysis
CEP
CEP
CEP
45. Copyright : Futuretext Ltd. London
Optional Storage
And Queries
Real-time
Feeds
Stream Processing Application
Alerts
Actions
Memory
Disk
Source: The 8 Requirements of Real-Time Stream Processing
By Michael Stonebraker et al
46. Copyright : Futuretext Ltd. London
Kafka
Producers
Brokers
Consumers
Front End Front End Front End Service
Hadoop
Clusters
Security
systems
Real-time
monitorin
g
Other
consumer
service
Data
warehous
e
47. Copyright : Futuretext Ltd. London
NoSql
HDFSData
Sources
Stream Processing Architecture based on Apache Spark
Adapted from
http://ingest.tips/2015/06/24/real-time-analytics-with-kafka-and-spark-streaming/
49. Copyright : Futuretext Ltd. London
Internet of Things
CNN,
RNN
Data Lake
Event
Based
analysis
Rules/
Workflow
Edge
Processing
Engine Rules/
Workflow
Alerts
Trigger s
Actions
Cloud / Data LakeEdge Device
Event
Collector
Predictive Alerts
Stream Processing System
Event
Store
Analytics
Model
Build Model
HDFS
Batch Processing System
Validate
Event
Sequence
CNN,
RNN
Data Lake
Event
Based
analysis
CEP
CEP
CEP
50. Copyright : Futuretext Ltd. London49
For example:
• Complex event processing involves combining outputs of multiple
sensors and inferring events from readings even when the event is not
directly observed by a specific sensor. For Complex event processing, we
also need to add statistical models such as likelihood, confidence and
probability using techniques like Bayesian networks, neural networks,
Dempster-Shafer methods, kalman filters etc (ex care home – image
Guardian)
51. Copyright : Futuretext Ltd. London
Quaternions
Heading
Pitch, roll and
yawLinear
acceleration
Gravity
Sensor fusion
algorithm
Inputs Outputs
3 –axis earth magnetic field
3 –axis linear acceleration
3 –axis angular rate
Source: ST microsystems
53. Copyright : Futuretext Ltd. London52
Creating an open methodology for Internet of Things (IoT)
Analytics: Data science for Internet of Things
January 9, 2016 By ajit Leave a Comment
54. Copyright : Futuretext Ltd. London53
There is no specific methodology to solve Data Science for IoT (IoT
Analytics) problems.
This leads to some initial questions:
Should there be a distinct methodology to solve Data Science problems for
IoT?
Are IoT problems for Data Science unique enough to warrant a specific
approach?
What existing methodologies should we draw upon?
On one hand , A Data Science for IoT problem is a typical Data Science
problem. On the other hand, there are some unique considerations to IoT –
for example in the use of Hardware, High Data volumes, Use of
CEP(Complex event processing), impact of verticals(like automotive),
Impact of streaming data etc.
55. Copyright : Futuretext Ltd. London54
Background and inspiration
Some initial background:
Data mining has well known methodologies such as Crisp DM. Hilary Mason
and others have also proposed specific methodologies for Data Science .
Kaggle problems have a specific approach to solving them . With techniques
like PFA(Portable format for Analytics) provide a way of formalizing and
moving Analytics models.
All these strategies also apply to IoT. IoT itself has methodologies like Ignite
IoT – but these do not cover IoT analytics in detail.
A methodology for IoT analytics(Data Science for IoT) should cover the
unique aspects of each step in Data Science. For example: It is more than
the choice of the model family. The choice of the model family (ANN, SVM,
Trees, etc) is only one of the many choices to make – Others include :
56. Copyright : Futuretext Ltd. London55
a) Choice of the model structure – optimisation methodology (CV,
Bootstrap, etc)
b) Choice of the model parameter optimisation algorithm (joint gradients
vs. conjugate gradients )
c) Preprocessing of the data (centring, reduction, functional reduction, log-
transform, etc.)
d) How to deal with missing data (case deletion, imputation, etc.)
e) How to detect and deal with suspect data (distance-based outlier
detection, density-based, etc.)
f) How to choose relevant features (filters, wrappers, embedded method ?)
g) How to measure prediction performances (mean square error, mean
absolute error, misclassification rate, lift, precision/recall, etc.)
source Methodology and standards for data analysis with machine learning
tools Damien Fran¸cois ∗
57. Copyright : Futuretext Ltd. London56
The methodology could also cover -
Exploratory analysis of data
Hypothesis testing (“Given a sample and an apparent effect, what is the
probability of seeing such an effect by chance?” )
and other ideas ..
Who?
Ajit Jaokar – futuretext
Jean-Jacques (JJ) Bernard, management & technology consultant
Shiva soleimani – student - Isfahan university
59. Copyright : Futuretext Ltd. London58
Data Science for Internet of Things – practitioner course – March
2016
Now running in it’s second batch ..
Welcome to the world’s first course that helps you to become a
Data Scientist for the Internet Of Things ..
60. Copyright : Futuretext Ltd. London59
Weekly schedule
Concepts
Week 0 March 15 Orientation, introductions, Personal learning plans, Platform
signup
Week 1 mar 21 Foundations:An analytics Driven Organization – IoT and
Machine Learning - Data Science for IoT – Unique
characteristics – Data Science for IoT – why now?
Mar 28 Machine Learning concepts Deep Learning concepts
Apr 4 An introduction to IoT (Internet of Things)
Apr 11 IoT platforms – From sensor to Cloud
Apr 18 Concepts of Big Data Part One
Apr 25 Concepts of Big Data Part Two
May 2 Market drivers for IoT
May 9 Choosing a model – what technique to Use?
May 16 Use Cases and IoT datasets (these will continue throughout
the course)
May 23 Time series and NoSQL databases
61. Copyright : Futuretext Ltd. London60
May 30 Streaming analytics part One
June 6 Streaming analytics part two
June 13 Deep learning part one
June 20 Deep learning part two
June 2 7 Machine learning algorithms – part one
July 4 Machine learning algorithms – part two
July 11 Mathematical foundations – part one
July 18 Mathematical foundations – part two
July To Dec 31 Project
Contact us at info@futuretext.com to signup
62. Copyright : Futuretext Ltd. London61
Programming
Week 0 Mar 15 Orientation, introductions, Personal
learning plans, Platform signup
Week 1 mar 21
Mar 28
Apr 4 Intro to R, Installations, Basics of R
Apr 11
Apr 18 Data Frames in R & Tabular Data
Apr 25
May 2 Data Processing & Data Visualization in R
May 9
May 16 Scala basics
May 23
May 30 Spark batch processing I
63. Copyright : Futuretext Ltd. London62
June 6
June 13 Spark Batch Processing II
June 20
June 2 7 Spark SQL
July 4
July 11 Spark Streaming
July 18
July To Dec 31 Projects
Contact us at info@futuretext.com to signup
65. Copyright : Futuretext Ltd. London64
A Reference Architecture for the Internet of Things
Daniel Karzel, Hannelore Marginean, Tuan-Si Tran
adapted from defined by IoT-A
The IoT interconnects the Things in order to exchange information to fulfill
tasks for the users. Ideas of fridges communicating not only with your
smart-phone, but with the producer's server farm or an energy power plant
will soon become reality.
Terminology:
• Thing: An object of our everyday life placed in our everyday
environment. A thing can be a car, fridge but can also be abstracted to a
complete house or city depending on the use case.
• Device: A sensor, actuator or tag. Usually the device is part of a thing.
The thing processes the devices’ context information and communicates
selected information to other things. Furthermore, the thing can pass
actions to actuators.
• Interoperability and Integration components
• Context aware components
• Middleware components(load balancing etc)
• Security
66. Copyright : Futuretext Ltd. London65
Anind K. Dey’s context toolkit. The context toolkit was designed on an
application level, as it was designed for Geographical Information Systems
(GIS). In the IoT we have to extend the context toolkit towards the
intercommunication between things. However, the basic idea of goal,
context information and resulting actions remains in the IoT world.
67. Copyright : Futuretext Ltd. London66
In the IoT world we don’t only define the goal on the user level (i.e. by
application), but things themselves can work towards certain goals without
actively including the user. In the end the devices still serve the user but
they act autonomously in the background – which is exactly the idea
of ubiquitous computing.
Context defines the state of an environment (usually the user’s
environment) in a certain place at a certain time. The context model usually
distinguishes between context elements and context situation.
Context elements define specific context, usually on the device level. A
context element can be for example a temperature value at a certain time
and location.
69. Copyright : Futuretext Ltd. London68
Location and time are context elements themselves, but they play a special
role as they are needed to locate sensor values in space and time. Without
knowing where and when a temperature was measured the temperature
does not help much for making conclusions.
The context situation is an aggregation of context elements. The context
situation is thus a view on the environment in a certain location at a certain
time.
Similarly to the context model you can also define an action model that
defines what things can trigger (e.g. open a window, take a photo). Actions
can only be triggered with the combination of context information (e.g. a
context situation) and defined goals. Goals are usually depicted as rules of
a rule engine (e.g. IF temperature > 25* THEN open window).
72. Copyright : Futuretext Ltd. London71
Consists of 6 layers. Besides these layers there are two “cross-section-
layers” that affect all other layers, namely “Security” and “Management”.
73. Copyright : Futuretext Ltd. London72
The device integration layer connects all the different device types and
consumes device measurements as well as it communicates actions (on
device level). This layer can be seen as a translator that speaks many
languages. The output of the sensors and tags depends on the protocol
they implement. The input of the actuators is also defined by the protocol
they implement.
74. Copyright : Futuretext Ltd. London73
The device management is in charge of taking device registrations and
sensor-measurements from the device integration layer. Furthermore it
communicates status changes for actuators down to the device integration
layer. The device integration layer then just validates that the status change
(i.e. the action) is conform with the actuator and then translates the status
change to the actuator.
75. Copyright : Futuretext Ltd. London74
The data management can be seen as a central database that holds all
data of a “thing”, but this is only one possible implementation. For larger
things within the system (e.g. a device life-cycle monitoring system
collecting data from other things) data management might be a data
warehouse or even a complete data farm. The implementation of the data
management layer thus strongly depends on the use-case for the specific
thing.
76. Copyright : Futuretext Ltd. London75
The context management defines the central business logic and is
responsible for six tasks: 1. Define the goals of the thing. 2. Consume the
context situation(s) of other things 3. Produce the (own) context situation
of the thing. 4. Evaluate the (own) context situation towards the goal. 5.
Trigger actions that help to fulfill the goal according to the evaluated rules.
6. Publish context situations for other things.
77. Copyright : Futuretext Ltd. London76
According to these tasks we can divide the context management into eight
components as shown below.
78. Copyright : Futuretext Ltd. London77
Rule Engine & Artificial Intelligence (AI): Define and manage all of the rules
necessary for context evaluation. This includes the goal (which is basically
as set of rules) as well as rules for creating the context situation and
actions.
Context Situation Integration Module: Listens to context situations of other
things and integrates the incoming context situations.
Action Integration Module: Incoming actions of other things are evaluated
and passed on to the device management layer by this component. Rules
have to be considered, that define in which situations an action received
from another thing can be passed on for triggering an actuator.
Context Situation Creator Module: Collects data from the system and builds
the context situation(s). This can also be driven by rules.
Action Creator Module: Similar to the context situation creator module,
action objects have to be created once triggered during rule evaluation.
79. Copyright : Futuretext Ltd. London78
Context Situation Publisher Module: Provide context situations to the thing
integration layer. According to the sophistication level of the implementation
the context situation publisher can provide a set of context situations for
different things that are subscribed or one context situation for everybody.
The context situation publisher module has to take care of data permission
levels towards other things. Only trusted other things should receive
selected context information. Furthermore this module has to take care of
defining the context situation schemas that are communicated to other
things that want to subscribe. The schema is used to evaluate whether a
thing is capable of communicating with another thing.
Action Publisher Module: Similar to the context situation publisher module
this module is responsible to communicate actions to the thing integration
layer to be communicated to other things. Additionally the action schema(s)
are managed by this component.
80. Copyright : Futuretext Ltd. London79
Context Evaluation Module: Evaluates the rules using the (current) context
situation and triggers actions that are communicated down to the devices or
to the action creator module. The action creator module in turn passes the
created actions to the action publisher that communicates the actions to
other things. One way to simply evaluate rules is to build decision trees
from the rules defined by the rule engine.
The concrete architecture and complexity of offered functionality strongly
depends on the use case for the thing under development. Especially the
rule engine & artificial intelligence component might not have to be very
sophisticated for less intelligent things (e.g. a fridge). For things that collect
context information from other systems these components will, however, be
very sophisticated. Higher sophistication can be for example data science
and data mining techniques.
81. Copyright : Futuretext Ltd. London80
The thing integration layer is responsible for finding other things and
communicating with them.
Once two things found each other they have to undergo a registration
mechanism. The thing integration layer has to evaluate if the
communication with the thing to be partnered with is possible. For this
purpose the context situation and/or action schemata have to be compared.
These are provided by the context management layer.
If the schema-match is evaluated positively, the thing can notify the other
thing upon new context situation or action creation. The context situations
and actions to be communicated to other things are provided by the context
management layer.
The thing registration can be done in a central component or by the thing
itself (e.g. auto-discovery network scan).
83. Copyright : Futuretext Ltd. London82
The application integration layer connects the user to the thing.
Applications that are (directly) on top of the architecture are located here.
The application integration can be seen as a service layer, or even as a
simple UI on top of the stack. The concrete implementation of the layer
depends on the use case.