Autonomous Learning for Autonomous Systems, by Prof. Plamen Angelov
1. Autonomous Learning
for Autonomous Systems
Prof Plamen Angelov, PhD, DSc, FIEEE, FIET
Vice President International Neural Network Society
School of Computing and Communications
Lancaster University, UK; e-mail: p.angelov@lancaster.ac.uk
3. Ottawa, 24/09/18
Outline
Part I
1. The new Data-rich Reality, Motivation,
Challenges, Historical remarks
2. Case studies/applications
Part II
3. Under the bonnet: how we can achieve
this
4. Ottawa, 24/09/18 4
Context
In this talk the rationale and basics of EIS will be
introduced while drawing close links with well
established methodologies and algorithms, e.g.:
✓Machine learning
✓Clustering
✓Classification
✓Adaptive systems
✓Mining data streams
✓Feature selection,
✓etc.
5. Ottawa, 24/09/18
Context
➢ Huge volumes of streaming data streams
➢True Intelligence is the Evolving Intelligence
➢Social networks, Internet, IoT, banking
➢Advanced industrial processes, transport
➢Intelligence, surveillance, defence
6. Ottawa, 24/09/18
Big Data (streams) – the new reality
• Although somewhat debatable as a term –
• we live in a new reality:
➢ GB of data in our pockets
➢ TB per day
➢ zettabytes (1021) of data and growing, exo-…
➢ streaming, heterogeneous, irregular sampling,…
➢The Moor’s law seems now to be more applicable to
the digital data than to hardware
7. Ottawa, 24/09/18
Offline is not an option
➢ Data streams cannot be analyzed in a batch mode
(storing & manipulating complete data is impossible)
➢ Instead, systems have to be developed that extract
knowledge from the data streams in real-time, on-line,
‘on the fly’
➢ For non-stationary data it is logical to assume a
dynamic/evolving structure
➢ Evolving – gradually developing ► a higher level of
adaptation
8. ▪ Non-stationary data streams - characterised
by shift and drift phenomena
▪ We would like to access our data and
information anywhere anytime
Ottawa, 24/09/18
Specifics of Data Streams
9. Ottawa, 24/09/18 9
✓ Extracting model structure – layered
structure, deep learning
✓ Streaming data – recursive, non-iterative ,
parallelizable, collaborative algorithms,
(low memory and computational) costs
✓ On-line down-selection of best inputs
✓ Even routine pre-processing (normalisation
and standardisation) when applied to data
streaming on-line are not trivial
Technical Challenges
10. Ottawa, 24/09/18 10
✓ Evolving vs Evolutionary
✓ First Evolving Fuzzy/Connectionist/Self-
organising NN/NF Systems (1998-2001; P.
Angelov, N. Kasabov, CTLin)
✓ First really dynamically evolving (2001-2004):
✓ eR (2001), eTS (2002), DENFIS (2002),
simpl_eTS (2005),…
✓ EIS – first time Angelov & Kasabov (2005, 2006)
✓ EIS (2010, Springer), ALS (2012)
Historical Remarks
11. Ottawa, 24/09/18
Applications
1) Fast, transparent Deep Fuzzy Rule-based Learning for
Image Classification
2) On-board Real-time Video Processing (detecting moving
targets on the ground – AURORA project)
3) Self-calibrating sensors in oil refining (chemical and petro-
chemical industry)
4) High Frequency Trading (HFT)
13. Ottawa, 24/09/18
Fast Interpretable Deep Learning
➢After the training process, each ALMMo system will generate 10
(1 per class/digit) AnYa type fuzzy rules in the following form:
14. Ottawa, 24/09/18
Fast Interpretable Deep Learning
➢Each ALMMo has 10 fuzzy rules, and each rule will give its output as a score of
confidence based on the “winner takes all” principle.
16. Ottawa, 24/09/18
Fast Interpretable Deep LearningApproaches
Accuracy
TrainingTime
PCParameters
GPUUsed
ElasticDistortion
Reproducibility
Parallelization
The Proposed
Approach
99.55%
Less than 2 minute
for each part of the
network
Core i7-4790
(3.60GHz),
16 GB DDR3
None NO YES YES
Large Convolutional
Neural Networks [6]
99.47% NO YES NO
Committee of 35
Convolutional Neural
Networks [4]
99.77%
Almost 14
hours for each one
of the 35 DNNs.
Core i7-920
(2.66GHz),
12 GB DDR3
2 GTX 480 &
2 GTX 580
YES NO NO
17. Fast Interpretable Deep
Learning (Classification Stage)
➢ Every ALMMo system passes its scores of confidence
corresponding to the 10 digits to the decision-making
committee and the committee integrates the outputs of all (154
per digit) into 10 overall scores of confidence:
➢ In general cases, the overall decision is made based on the
“winner takes all” principle:
( ) ( ) ( )( )
( ) ( )( )
, ,
1,2,...,1
, ,
1,2,...,1
1 1
max
2
1 1
max
2
SR
M
j i j i j
i SRi
SR
i j i j
i SRi
C Image
SR
SR
==
==
= +
+ +
g g
h h
( )( )0,1,...,9
arg max M
j
j
Label C Image
=
=
20. A Massively Parallel Deep Rule-Based Ensemble
Classifier for Remote Sensing Scenes
21. A Massively Parallel Deep Rule-Based Ensemble
Classifier for Remote Sensing Scenes
22. Semi-supervised Deep Rule-based Classifier using
Deep Representation
• In this work, the pre-
trained vgg-verydeep-
16 convolutional
neural network is
used as the feature
descriptor, and the
1x4096 dimensional
activations from the
first fully connected
layer is used as the
feature vector of the
image.
24. Ottawa, 24/09/18 24
DFRBL summary
• Fast deep learning;
• Rule-base (as oppose to NN black box) –
therefore, transparent;
• 0-order prototypes based – therefore,
interpretable;
• Fully parallelisable
• Non-iterative
• Non-parametric
• Evolving
• Human level precision
• Fully repeatable
25. Ottawa, 24/09/18
HFT problem description
• Due to fast changes on the global markets, individual agents
have to deal with a huge amount of HF financial data to:
– determine buy and sell recommendations;
– submit trades to exchanges for execution;
– manage trades after execution;
– analyse market data.
• Many HFT algorithms are based on technical analysis that
uses historical prices and indicators to identify zones of
supply/demand where sellers/buyers are likely to change
the price of the product.
26. Ottawa, 24/09/18
HFT experiment
• Data stream: QuantQuote Second Resolution Market dB
• Attributes: 1) Time; 2) Open price; 3) High price; 4) Low price; 5) Close
price; predicting the future value of 3).
• Five steps ahead prediction of the High price.
• Left plot: prediction; right plot: evolution of the model structure
29. Results
• The time interval of
between 2 data ticks
ranges from as little as a
second to few minutes,
however the proposed
method only needs less
than 0.001 second in
average to process each
tick
33. eSensor – self-callibrating
Final Rule base in the Abel inflammability test
R1: IF (P is 5.4%) AND (Tco is 323.3 oC) AND …AND (Tne is 126.8 oC)
THEN (A1=20.2 + 92.7P + … + 0.12 Tne)
R 2: IF (P is 11.7%) AND (Tco is 365.0 oC)AND …AND (Tne is 147.6 oC)
THEN (A2=42.1 + 63.4P + … + 0.10 Tne)
R 3: IF (P is 5.4%) AND (Tco is 335.14 oC) AND…AND (Tne is 136.1 oC)
THEN (A3=25.2 + 71.9P + … + 0.19 Tne)
Low end
point of
Naphtha
High end point
of Naphtha
34. Ottawa, 24/09/18 34
✓ Because traditional offline approaches,
using supervised train vs test vs validate
data split is not applicable
✓ Unsupervised concept learning form data
distributions and realisations
✓ Data → Concepts→ Relations and/or Rules
✓ Data Clouds (AnYa,2010) – suitable for
Concept representation
Concepts Leanring
36. Ottawa, 24/09/18
➢ One can use various mathematical
frameworks to model big data, e.g. (but
not limited to!):
✓ Probabilistic (HMM, Bayesian, PF, MC,…)
✓ Computational Intelligence (ANN, FRB,…)
✓ Decision Trees (DT)
✓ Grammar (Natural language, NL)
✓ EDA (Empirical data analytics, Angelov, 2014)
Model Frameworks
37. Ottawa, 24/09/18
➢ Traditionally, system identification– mostly
parameter identification (system structure is
selected, pre-defined!)
a) Divide et impera
b) Adapt/Evolve or die
c) Layered structure
Principles
38. Divide et Impera (Lat.)
Behind all framework types is the old Roman
principle ‘Divide et impera’ - The data space
is decomposed into Granules (states, pdf,
neurons, rules, words, etc.).
0
0.5
1
0.5
0 -0.5
-1 -1
-0.5 0 0.5 1
Ottawa, 24/09/18
39. Ottawa, 24/09/18
Divide et impera
)x(fy =Decompose complex processes
into multiple simple models, x
which Adapt & Evolve
n
Rx
m
Ry T
y =
TT
e
NT
e xx ],...,[ 1
=
y
x1* x2*
y1
y2
x y
…
yR
Local model 1
Local model 2
Local model R
40. Ottawa, 24/09/18
Local sub-Models
This leads to the concept of local sub-
models/sub-systems, e.g.:
✓States (HMM):
✓Activation functions
✓pdf
✓Rules
✓m, etc.
S1 Sn
Snew
p11 p1n pnn
43. Data Clouds vs Clusters
Having data clouds (areas for local models) we can now define the ALS
as a set of rules of the following form:
Ottawa, 24/09/18
Table A Comparison between Cluster and Clouds
( ) ( )iii
LMTHENCxIFRule ~:
Features Cluster Data Clouds
Boundaries Defined as hyper-ellipsoids Voroni tessellation
Centre/Prototype Defined Extracted post factum
Distance between a data point and Centre/Mean Focal point
Membership function
Parameterised Non-parametric
Approximation of an ideal
distribution, assumed a priori
Reflect the real data
distribution
45. Empirical data Analytics
• Probability theory, statistics– number of restrictive
assumptions which usually do not hold in reality
➢ pre-defined smooth, “convenient to use” types of
distribution;
➢ infinite amount of observations/data points;
➢ independence between data points (so called iid –
independent and identically distributed data)
➢ pdf has a number of paradoxes
Ottawa, 24/09/18
paradoxes
46. Empirical data Analytics
➢ EDA-Entirely based on the empirical
observations of discrete data points and their
mutual position forming a unique pattern in the
data space
➢ An effective combination of the frequency and
the space distance.
Ottawa, 24/09/18
47. EDA: Cumulative Proximity
Cumulative proximity is a measure indicating the degree of
closeness/similarity of a particular data point to all other existing
data points:
Ottawa, 24/09/18
( ) ( ) ( ) 01,
1
2
= =
ik
k
j
jiik xkxxdx
48. EDA – basic measures, ε
2. Standardized Eccentricity - represents the association
of the data point with the tail of the distribution and
the property of being an outlier/ anomaly:
ε is very convenient to represent well known Chebyshev
inequality. It turns into a simple check if >10 or not
for n=3 because:
Ottawa, 24/09/18
( ) ( ) ( )
( )
( ) 01
1
2
1
==
=
ikk
j
jk
ik
ikik xk
x
k
x
xkx
( )N x
( )( )2
2
1
1 1NP n
n
+ −x
49. EDA – basic measures, D
3. Data density is inversely proportional to the
standardized eccentricity.
It can be proven that for Euclidean and Mahalnobis
distances D gets the form of Cauchy function:
Ottawa, 24/09/18
( ) 2
T
1
1
N i
i N
N N N
D
X
=
−
+
−
x
x
1 1 1
1 1
;N N N
N
N N
−
−
= + =μ μ x μ x
T T
1 1 1 1
1 1
;N N N N
N
X X X
N N
−
−
= + =x x x x
50. Ottawa, 24/09/18
KDE and RDE
• The density in the data space is a key charac-
teristic of anomalies and model structure (focal
points for local sub-models).
• Traditionally–KDE, Parzen (‘62):
• RDE–Cauchy type, Angelov(‘02)
( )=
−=
k
i
ikk xxK
k
xD
1
1
)(
...
1
1
11
)(
1
21
1
1
2
1
2
+ −+
=
=
=
−
−−
=
=
k
i
ki
xx
k
xx
k
k
xx
ke
exD k
i
ki
k
i
ki
51. Ottawa, 24/09/18
RDE (patented)
• It can be recursively updated (Angelov, 2008):
• One can use RDE for:
a) Model structure update (Angelov, 2002);
b) Anomaly/outliers detection in RT (Angelov, 2007)
c) Typicality & Eccentricity (Angelov, 2014)
d) RTSDE (Angelov, 2014)
22
1
1
)(
kkkk
k
Xx
xD
−+−+
=
111
11
xx
kk
k
kkk =+
−
= − 2
1
2
1
1
11
xXx
k
X
k
k
X kkk =
−
+
−
= −
54. EDA – basic measures, τ
4. Multi-modal (discrete global) typicality:
Ottawa, 24/09/18
( )
( )
( )
( )
( )
1
1
1 1
N N
j N ii N iD
N i L L
j N j j N j
j j
f qf D
f D f q
−
−
= =
= =
uu
u
u u
55. EDA – basic measures, τ
• Continuous global density and typicality:
• Integrates to 1, can be
used as pdf, but is derived
entirely from data with no
prior assumptions
Ottawa, 24/09/18
( )
( )
( )
( )
( )
, ,
1
, ,
1
N
N
C
L
G N i N i
NG i
N C
G L
N N i N i
i
S D
D
D d S D d
=
=− −
= =
x
x
x
x x x x
( ) ,
1 1
1 22 2
,
, 2
,
1
2
1
NC
N iG
N d d
i
N id
N i
N i
d
S
N
+ +
=
+
=
−
+
x
x
59. 59
The EIS approach
Predict/
classify/
control
Life-cycle
EIS
Data acquisition
pre- process
hard sensors
EIS (evolve
or die)
RDE,
detect
outliers
Model structure
(data clouds)
Learn
parameters
Learning from experience (extract knowledge from
data streams) throughout the whole life-cycle:
Reflects human ability to
acquire, summarize, and
manage knowledge by
learning “on the fly”
Ottawa, 24/09/18
60. Ottawa, 24/09/18
Input … Output(s)
Data
Model structure update
based on R(TS)DE or m
Model evolution
The approach can be summarised as:
✓Decomposition of the Data into Clouds (2010)
✓Joint RT identification of local sub-models
✓Overall output – a soft blend of local outputs
61. Ottawa, 24/09/18
Evolving based on Density
▪ The core of ALS - recursively estimate data
density from data streams and to react on its
variations by modifying the model structure:
– A1) a data sample with high density is eligible to be a
focal point of a Cloud/local sub-model
– A2) a data sample that lies in an area of data space
not covered by other local sub-models is also eligible
to form a new local sub-model
– B3) avoid overlap and information redundancy in
forming new local sub-models
– C4) remove old clouds and low support & utility ones
– D5) select on-line the input variables that contribute
most to the output
62. Local models weighting
based on relative density
x
y 1
newD
G2
G1
Ottawa, 24/09/18
2
newD
2
newz
1
newz
The (fuzzy) weight, λi to a cloud is described by
the relative local density:
=
= N
j
j
k
i
ki
k
D
D
1
1
1
==
N
i
i
10 i
63. Learning local model parameters
with proven convergence/stability
➢ The error ϵk caused by approximating f(x) is
bounded with the constant ε given by
➢ fwRLS – Globally optimal stable/converging solution:
➢ 𝛼 𝑘 is the time-varying (but bounded) learning rate
given as:
Ottawa, 24/09/18
𝑦 𝑘=𝐻 𝑘
𝑇
𝑄∗ + ϵk
|𝜖 𝑘|≤ 𝜀
𝑄 𝑘+1=𝑄 𝑘 + 𝛼 𝑘Σ 𝑘 𝐻 𝑘 𝑒 𝑘
Σ 𝑘+1 = Σ 𝑘 − 𝛼 𝑘Σ 𝑘 𝐻 𝑘 𝐻 𝑘
𝑇
Σ 𝑘
𝛼 𝑘 =
1
1 + 𝐻 𝑘
𝑇
Σ 𝑘 𝐻 𝑘
65. EIS as FRB systems
..........................................................................................
( )
+=
+= ==
n
j
j
m
jR
m
R
m
R
n
j
jjRRR
RR
xaayANDANDxaayTHEN
GranuleisxIFRule
1
0
1
11
0
1
...
:
On-line input variables selection
remove
add
rulesrules
( )
+=
+= ==
n
j
j
m
j
mm
n
j
jj xaayANDANDxaayTHEN
GranuleisxIFRule
1
1011
1
1
1
1
01
1
1
11
...
:
Ottawa, 24/09/18
66. Prediction,
Classification or Control
Anomaly/outliers detection (low D)
Clustering (focal points with high D)
Classification:
Prediction:
Control
Ottawa, 24/09/18
The problem formulation is generic. It applies to:
( ) ( )iii
LabelxTHENCxIFRule →~: ( )j
N
j
Label
1
maxarg
=
=
=
=+
N
i
ii
LMky
1
)1(
67. Ottawa, 24/09/18
Analyze rule quality
– Support – No of samples
– Age:
*
1
minarg1 i
k
N
i
ll
zzlforSS −=+
=
;
*
1
minarg i
k
N
i
l
k
l
k zzlforkII −=+
=
],1[;
)(
)(
)(
1
Ri
kN
I
kkAge
i
kN
l
l
i
l
=−=
=
68. Ottawa, 24/09/18
Detecting shift by rule age
▪ Shift in the data stream can be detected by the
age of the rule which corresponds to the
inflexed point of the Age curve (when the
derivative of Age changes its sign:
( )
dk
Aged
69. Ottawa, 24/09/18
Rule utility
▪ The utility of a fuzzy rule represents the
accumulated strength of a fuzzy rule
▪ It is defined as the accumulated firing level of
the respective fuzzy rule for the span of its life
],1[;)( 1
Ri
tk
k
i
k
l
l
i =
−
=
=
70. Ottawa, 24/09/18
On-line inputs selection
▪ Because sub-models are locally linear, the
sensitivity analysis reduces to analysis of the
consequent parameters:
▪ The importance of each input (feature) can be
evaluated by the ratio of the accumulated sum
of the consequent parameters for the specific
jth input (feature) in respect to all n inputs:
],1[];,1[;
)(
)(
)(
1
njRi
kT
kT
k n
r
ir
ij
ij ===
=
=
=
k
l
ijij lk
1
)()(
],1[];,1[;)()(
1
1*
*
njRikTkj
n
r
irij == =
72. Ottawa, 24/09/18
Researchers
• Gruffydd Morris (2005-now), finishing PhD
• Dr. Xiaowei Zhou (2005-2008), now CEO, China
• Dr. Ramin Ramezani (2007-9), now UCLA
• Dr. Pouria Sadeghi-Tehran (2009-2015), now RI
• Dr. Javier Andreu (2008-2012), now Imperial College
• Tu Vong (2011-2012), now ARM, Cambridge
• Chris Clarke (2013-2014), now doing PhD here
• Antoniou Antreas (2014-2015), now Amazon
• Many other MSc students
73. Ottawa, 24/09/18 73
Publications
• P. Angelov, Autonomous Learning Systems: From Data to Knowledge in Real
Time, Wiley, 2012
• P. Angelov, X. Gu, J. Principe, A generalized methodology for data analysis,
IEEE Transactions on Cybernetics, DOI: 10.1109/TCYB.2017.2753880, 2017.
• P Angelov et al,Empirical data analytics,Int.J.Intel.Syst,2017,DOI:10.1002/int.21899
• P. Angelov, X. Gu, J. Principe, Autonomous learning multi-model systems from
data streams, IEEE Trans. on Fuzzy Systems, DOI:10.1109/TFUZZ.2017.2769039, 2017.
• P. Angelov, X. Gu, Empirical Fuzzy Sets, Int. Journal of Intel. Syst., DOI:
10.1002/int.21935, 2017.
• P. Angelov, X. Gu, MICE: Multi-layer multi-model images classifier ensemble,
IEEE Intern. Conference on Cybernetics (CYBCONF), Exeter, UK, 2017, pp. 1-8.
• P. Angelov, X. Gu, A Cascade of Deep Learning Fuzzy Rule-based Image
Classifier and SVM, IEEE Intern. Conf. on SMC (SMC2017), Banff, Canada, 2017
• P. Angelov, et al., Fast feedforward non-parametric deep learning network
with automatic feature extraction, IJCNN-2017, Anchorage, USA, pp. 534-541.
74. Ottawa, 24/09/18
Highlights
– True Intelligence is Dynamically Evolving
– Evolving Model Structure
– HFT and Evolving Fast Transparent Deep
Learning classifier applications of EIS