Artificial Neural Networks (ANN) and Deep Learning have taken the data science scene by a storm. However most of the current results and demonstrators are centered around image, text, voice processing. However this techniques are extremely powerful and generic machine learning algorithms which can be applied as well in more traditional domains such as cyber security, and personalized marketing. In this webcast, I will introduce and de-hype deep learning. Introduce a number of ANN patterns and the problems which can be solved by training those models on the given available data. I will also briefly describe the available libraries and how to speed up the learning process with GPU's and distributed computing.
2. Data Scientist, Big and Fast Data Architect
Currently at Teradata
Previously:
Enterprise Data Architect at ING
Senior Researcher at Philips Research
Interests:
Spark, Flink, Cassandra, Akka, Kafka, Mesos
Anomaly Detection, Time Series, Deep Learning
3. Data Science: approaches
Supervised:
- you know what the outcome must be
Unsupervised:
- you don’t know what the outcome must be
Semi-Supervised:
- You know the outcome only for some samples
4. Popularity of Neural Networks: “The cat neuron”
Andrew Ng, Jeff Dean et al:
1000 Machines
10 Million images
1 Billion connections
Train for 3 days
http://research.google.com/archive/unsupervised_icml2012.html
5. Popularity of Neural Networks: “AI at facebook”
Yann LeCunn
Director of AI research at Facebook
Ask the AI what it sees in the image
“Is there a baby?”
Facebook’s AI: “Yes.”
“What is the man doing?”
Facebook’s AI: “Typing.”
“Is the baby sitting on his lap?”
Facebook’s AI: “Yes.”
http://www.wired.com/2015/11/heres-how-smart-facebooks-ai-has-become/
6. Data Science: approaches
Supervised:
- you know what the outcome must be
Unsupervised:
- you don’t know what the outcome must be
Semi-Supervised:
- You know the outcome only for some samples
7. Unsupervised Learning
- Clustering, Feature extraction
Imagining, Medical data, Genetics, Crime patterns,
Recommender systems, Climate hot spots analysis, anomaly detection
… Given a set of items,
it answers the question “how can we efficiently describe the collection?
It defines a measure of “similarity” between items.
8. Supervised Learning
- Classification
Marketing Churn, Credit Loan, Success rate
Insurance Defaulting, Health conditions and patologies
Categorization of wine, real estates,
… Given the values of some properties,
it answers the question “to which class/group does this item belong?”
9. Classification: Dimensionality matters
- Number of dimensions or features of your input data
- Statistical relations, smoothness of the data
- Embedded space
input : 784 dimensions
output: 10 classes
input : 4 dimensions
output: 3 classes
28x28 pixels
10. AI, complexity and models
Does it do well on
Training Data ?
Does it do well on
Test Data ?
Bigger Neural Network
(rocket engine)
More Data
(rocket fuel)
yes yes
no
no
Done?
Different
Architecture
(new rocket)
no
https://www.youtube.com/watch?v=CLDisFuDnog
11. Evolution of Machine Learning
Input
Hand Designed
Program
Rule-based System
Output
Prof. Yoshua Bengio - Deep Learning
https://youtu.be/15h6MeikZNg
12. Evolution of Machine Learning
Input
Hand Designed
Program
Input
Rule-based System
Output
Hand Designed
Features
Mapping from
features
Output
Classic Machine
Learning
Prof. Yoshua Bengio - Deep Learning
https://youtu.be/15h6MeikZNg
13. Evolution of Machine Learning
Input
Hand Designed
Program
Input Input
Rule-based System
Output
Hand Designed
Features
Mapping from
features
Output
Learned
Features
Mapping from
features
Output
Classic Machine
Learning
Representational
Machine Learning
Prof. Yoshua Bengio - Deep Learning
https://youtu.be/15h6MeikZNg
14. Evolution of Machine Learning
Input
Hand Designed
Program
Input Input
Rule-based System
Output
Hand Designed
Features
Mapping from
features
Output
Learned
Features
Mapping from
features
Output
Classic Machine
Learning
Input
Learned
Features
Learned
Complex features
Output
Mapping from
features
Representational
Machine Learning
Deep Learning
Prof. Yoshua Bengio - Deep Learning
https://youtu.be/15h6MeikZNg
16. Logit model: Perceptron
1 Layer Neural Network
Takes: n-input features: Map them to a soft “binary” space
∑
x1
x2
xn
f
17. Multiple classes: Softmax
From soft binary space to predicting probabilities:
Take n inputs, Divide by the sum of the predicted values
∑
x1
x2
xn
f
∑ f
softmax Cat: 95%
Dog: 5% Values between 0 and 1
Sum of all outcomes = 1
It behaves like a probability,
But it’s just an estimate!
18. Cost function: Supervised Learning
The actual outcome is different than the desired outcome
We measure the difference!
This measure can be done in various ways:
- Mean absolute error (MAE)
- Mean squared error (MSE)
- Categorical Cross-Entropy
Compares estimated probability vs actual probability
19. Minimize cost: How to Learn?
The cost function depends on:
- Parameters of the model
- How the model “composes”
Goal :
modify the parameters to reduce the error!
Vintage math from last century
20. Build deeper networks
Stack layers of perceptrons
- “Sequential Network”
- Back propagate the error
SOFTMAX
Input parameters
Classes (estimated probabilities)
Feed-forward
Cost function
supervised : actual output
Correct
parameters
21. Some problems
- Calculating the derivative of the Cost function
- can be error prone
- Automation would be nice!
- Complex network graph = complex derivative
- Dense Layers (Fully connected)
- Harder to converge
- Number of parameters grows fast!
- Overfitting and Parsimony
- Learn “well”, generalization capacity
- Be efficient in the number of parameters
22. Some Solutions
- Calculating the derivative of the Cost function
- Software libraries
- GPU support for computing vectorial and tensorial data
- New Layers Types
- Convolution Layers 2D/3D
- Dropout layer
- Fast activation functions
- Faster learning methods
- Derived from Stochastic Gradient Descend (SGA)
- Weight initializations with Auto-Encoders and RBM
23. Convolutional Networks
Idea 1: reuse the weights across while scanning the image
Idea 2: subsampling results from layers to layers
24. Fast Activation Functions
Idea: don’t use complex exponential functions,
linear functions are fast to compute, and easy to differentiate !
25. Dropout Layer, Batch Weight Normalization
Dropout:
Set randomly some of the input to zero.
It improves generalization and makes the network function more robust to errors.
Batch Weight Normalization:
Normalize the activations of the previous layer at each batch.
26. Efficient Symbolic Differentiation
There are good libraries which calculate the derivatives symbolically of an
arbitrary number of stacked layers
● efficient symbolic differentiation
● dynamic C code generation
● transparent use of a GPU
CNTK
27. Efficient Symbolic Differentiation (2)
There are good libraries which calculate the derivatives symbolically of an
arbitrary number of stacked layers
● efficient symbolic differentiation
● dynamic C code generation
● transparent use of a GPU
>>> import theano
>>> import theano.tensor as T
>>> from theano import pp
>>> x = T.dscalar('x')
>>> y = x ** 2
>>> gy = T.grad(y, x)
>>> f = theano.function([x], gy)
pp(f.maker.fgraph.outputs[0])
'(2.0 * x)'
28. Higher Abstraction Layer: Keras
Keras: Deep Learning library for Theano and TensorFlow
- Easier to stack layers
- Easier to train and test
- More ready-made blocks
http://keras.io/
29. Example 1: Iris classification
Categorize Iris flowers based on
- Sepal length/width
- Petal length/width
3 classes,
Dataset is quite small (150 samples)
- Iris Setosa
- Iris Versicolour
- Iris Virginica
input : 4 dimensions
output: 3 classes
31. Example 2: telecom customer marketing
Semi-synthetic dataset
The "churn" data set was developed to predict telecom customer churn based on information about their account. The data files state that the data are "artificial
based on claims similar to real world". These data are also contained in the C50 R package.
1 classes (churn)
Dataset is quite small (about 3000 samples)
17 input dimensions:
State, account length, area code, phone number,international plan,voice mail plan,number vmail messages,total day
minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night
calls,total night charge,total intl minutes,total intl calls,total intl charge,number customer service calls
33. Models: Small Data, Big Data
- Not all domains have large amount of data
- Think of Clinical Tests, or Lengthy/Costly Experimentations
- Small specialized data set and Neural Networks
- Good for complex non-linear separation of classes
Interesting Read:
https://medium.com/@ShaliniAnanda1/an-open-letter-to-yann-lecun-22b244fc0a5a#.ngpal1ojx
34. Conclusions
- Neural Networks can be used for small data as well
- Other methods might be more efficient in this scenario’s
- Neural Networks are an extension to GLMs and linear regression
- Learn Linear Regression, GLM, SVM as well
- Random Forests and Boosted Trees are an alternative
- More data = Bigger and better Neural Networks
- We have some tools to jump start analysis