Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Learning

4,802 views

Published on

Deep Learning taught in UCL Information Retrieval and Data Mining MSc course.

Published in: Technology

Deep Learning

  1. 1. Deep Learning Dr. Jun Wang Computer Science, UCL As part of COMPGI15/M052 Information Retrieval and Data Mining (IRDM)
  2. 2. AlphaGo vs. the world’s ‘Go’ champion Coulom, Rémi. "Whole-history rating: A Bayesian rating system for players of time-varying strength." Computers and games. Springer Berlin Heidelberg, 2008. 113-124. http://www.goratings.org/
  3. 3. Deep Learning helps Speech Translation http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html http://research.microsoft.com/en-us/about/speech-to-speech-milestones.aspx
  4. 4. Deep Learning helps Computer Vision http://blogs.technet.com/b/inside_microsoft_research/archive/2015/02/10/microsoft- researchers-algorithm-sets-imagenet-challenge-milestone.aspx
  5. 5. Leadertable (ImageNet image classification)24 Olga Russakov Image classification Year Codename Error (percent) 99.9% Conf Int 2014 GoogLeNet 6.66 6.40 - 6.92 2014 VGG 7.32 7.05 - 7.60 2014 MSRA 8.06 7.78 - 8.34 2014 AHoward 8.11 7.83 - 8.39 2014 DeeperVision 9.51 9.21 - 9.82 2013 Clarifai† 11.20 10.87 - 11.53 2014 CASIAWS† 11.36 11.03 - 11.69 2014 Trimps† 11.46 11.13 - 11.80 2014 Adobe† 11.58 11.25 - 11.91 2013 Clarifai 11.74 11.41 - 12.08 2013 NUS 12.95 12.60 - 13.30 2013 ZF 13.51 13.14 - 13.87 2013 AHoward 13.55 13.20 - 13.91 2013 OverFeat 14.18 13.83 - 14.54 2014 Orange† 14.80 14.43 - 15.17 2012 SuperVision† 15.32 14.94 - 15.69 2012 SuperVision 16.42 16.04 - 16.80 2012 ISI 26.17 25.71 - 26.65 2012 VGG 26.98 26.53 - 27.43 2012 XRCE 27.06 26.60 - 27.52 2012 UvA 29.58 29.09 - 30.04 Single-object localization Year Codename Error (percent) 99.9% Conf Int 2014 VGG 25.32 24.87 - 25.78 2014 GoogLeNet 26.44 25.98 - 26.92 Fig. 10 For each object class, we consider the b mance of any entry submitted to ILSVRC2012-20 ing entries using additional training data. The the distribution of these “optimistic” per-class resu mance is measured as accuracy for image classific and for single-object localization (middle), and precision for object detection (right). While the very promising in image classification, the ILSVR are far from saturated: many object classes cont challenging for current algorithms. dence interval. We run a large number of b ping rounds (from 20,000 until convergence) shows the results of the top entries to eac ILSVRC2012-2014. The winning methods a tically significantly di↵erent from the other even at the 99.9% level.Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252. Unofficial human error is around 5.1% on a subset Why still have a human error? When labeling, human raters judged whether it belongs to a class (binary classification); however the challenge is a 1000-class classification problem. 2015 ResNet (ILSVRC’15) 3.57 GoogLeNet, 22 layers network Microsoft ResNet, a 152 layers network U. of Toronto, SuperVision, a 7 layers network http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/)
  6. 6. Deep Learning helps Natural Language Processing https://code.google.com/p/word2vec/ ualization of Regularities in Word Vector Space −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 princess prince heroine hero actress actor landlady landlord female male cow bull hen cockqueen king she he 23 / 31 From bag of words to distributed representation of words to decipher the sematic meaning of a word Compositionality by Vector Addition Expression Nearest tokens Czech + currency koruna, Czech crown, Polish zloty, CTK Vietnam + capital Hanoi, Ho Chi Minh City, Viet Nam, Vietnamese German + airlines airline Lufthansa, carrier Lufthansa, flag carrier Lufthansa Russian + river Moscow, Volga River, upriver, Russia French + actress Juliette Binoche, Vanessa Paradis, Charlotte Gainsbourg
  7. 7. Deep Learning helps Control http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html#videos >75% human level
  8. 8. ‘Deep Learning’ buzz ‘deep learning’ is reborn in 2006 and becomes a buzzword recently with the explosion of 'big data', 'data science', and their applications mentioned in the media. big data neural networks data science deep learning
  9. 9. What is Deep Learning? • “deep learning” has its roots in “neural networks” • deep learning creates many layers (deep) of neurons,attempting to learn structured representation of big data, layer by layer A computerised artificial neuron Le, Quoc V. "Building high-level features using large scale unsupervised learning." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013. A biological neuron
  10. 10. Brief History • The First wave – 1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model – 1958 Rosenblatt introduced the simple single layer networks now called Perceptrons. – 1969 Minsky and Papert’s book Perceptrons demonstrated the limitation of single layer perceptrons, and almost the whole field went into hibernation. • The Second wave – 1986 The Back-Propagation learning algorithm for Multi- Layer Perceptrons was rediscovered and the whole field took off again. • The Third wave – 2006 Deep (neural networks) Learning gains popularity and – 2012 made significant break-through in many applications.
  11. 11. Summary • Neural networks basics – its backpropagation learning algorithm • Go deeper in layers • Convolutional neural networks (CNN) – address feature dependency • Recurrent neural networks (RNN) – address time-dependency • Deep reinforcement learning – have the ability of real-time control
  12. 12. Summary • Neural networks basics • Go deeper in layers • Convolutional neural networks (CNN) • Recurrent neural networks (RNN) • Deep reinforcement learning
  13. 13. Early Artificial Neurons (1943-1969) • A biological neuron • In the human brain, a typical neuron collects signals from others through a host of fine structures called dendrites • The neuron sends out spikes of electrical activity through a long, thin stand known as an axon, which splits into thousands of branches • At the end of each branch, a structure called a synapse converts the activity from the axon into electrical effects that inhibit or excite activity in the connected neurons.
  14. 14. Early Artificial Neurons (1943-1969) • McCulloch- Pitts neuron • McCulloch and Pitts [1943] proposed a simple model of a neuron as computing machine • The artificial neuron computes a weighted sum of its inputsfrom other neurons, and outputs a one or a zero according to whether the sum is above or below a certain threshold x1 x3 x2 f (x) = 1 if x ≥ 0; 0 otherwise ⎧ ⎨ ⎪ ⎩⎪ y y = f ( wm xm − b m ∑ )
  15. 15. Early Artificial Neurons (1943-1969) • Rosenblatt’s single layer perceptron • Rosenblatt [1958] further proposed the perceptron as the first model for learning with a teacher (i.e., supervised learning) • Focused on how to find appropriate weights wm for two-class classification task (y = 1: class one and y = -1: class two) y =ϕ( wm xm − b m ∑ ) ϕ(x) = 1 if x ≥ 0; −1 otherwise ⎧ ⎨ ⎪ ⎩⎪ ron is built around a nonlinear neuron, namely, the McCulloch–Pitts From the introductory chapter we recall that such a neural model ombiner followed by a hard limiter (performing the signum func- Fig. 1.1. The summing node of the neural model computes a lin- he inputs applied to its synapses,as well as incorporates an externally sulting sum, that is, the induced local field, is applied to a hard Inputs Hard limiter Bias b x1 x2 xm wm w2 w1 w(и)v Output y• • • Activation function:
  16. 16. Early Artificial Neurons (1943-1969) • Rosenblatt’s single layer perceptron • Proved the convergence of a learning algorithm if two classes said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane) • Many people hope that such machines could be a basis for artificial intelligence y =ϕ( wm xm + b m ∑ ) ϕ(x) = 1 if x ≥ 0; −1 otherwise ⎧ ⎨ ⎪ ⎩⎪ 1.2 PERCEPTRON Rosenblatt’s perceptron is built around a nonlinear neuron, namely, the McCulloch–Pitts model of a neuron. From the introductory chapter we recall that such a neural model consists of a linear combiner followed by a hard limiter (performing the signum func- tion), as depicted in Fig. 1.1. The summing node of the neural model computes a lin- ear combination of the inputs applied to its synapses,as well as incorporates an externally applied bias. The resulting sum, that is, the induced local field, is applied to a hard Inputs Hard limiter Bias b x1 x2 xm wm w2 w1 w(и)v Output y• • • FIGURE 1.1 Signal-flow graph of the perceptron. y(t) =ϕ( wm (t) xm (t) + b(t) m ∑ ), t = {1,...,T} d(t) is training label and η is a parameter (the learning rate) Δwm (t) =η(d(t) − y(t) )xm (t) wm (t+1) = wm (t+1) + Δwm (t) Haykin S S, Haykin S S, Haykin S S, et al. Neural networks and learning machines[M]. Upper Saddle River: Pearson Education, 2009.
  17. 17. Early Artificial Neurons (1943-1969) • The XOR problem • However, Minsky and Papert [1969] shows that some rather elementary computations, such as XOR problem, could not be done by Rosenblatt’s one-layer perceptron • However Rosenblatt believed the limitations could be overcome if more layers of units to be added, but no learning algorithm known to obtain the weights yet • Due to the lack of learning algorithms people left the neural network paradigm for almost 20 years Input x Output y X1 X2 X1 XOR X2 0 0 0 0 1 1 1 0 1 1 1 0 X1 1 true false false true 0 1 X2 XOR is non linearly separable: These two classes (true and false) cannot be separated using a line.
  18. 18. Hidden layers and the backpropagation (1986~) • Adding hidden layer(s) (internal presentation) allows to learn a mapping that is not constrained by “linearly separable” decision boundary: x1w1 + x2w2 + b = 0 class 1 class 2 b x1 x1 y w1 w2 b class 1 class 2 class 2 class 2 class 2 x2 x1 y Each hidden node realises one of the lines bounding the convex region
  19. 19. Hidden layers and the backpropagation (1986~) • But the solution is quite often not unique Input x Output y X1 X2 X1 XOR X2 0 0 0 0 1 1 1 0 1 1 1 0 The number in the circle is a threshold http://recognize-speech.com/basics/introduction-to-artificial-neural-networks http://www.cs.stir.ac.uk/research/publications/techreps/pdf/TR148.pdf (solution 1) (solution 2) ϕ(⋅) is the sign function two lines are necessary to divide the sample space accordingly
  20. 20. Hidden layers and the backpropagation (1986~) Two-layer feedforward neural network • Feedforward: massages moves forward from the input nodes, through the hidden nodes (if any), and to the output nodes. There are no cycles or loops in the network Parameters weights Parameters weights
  21. 21. Hidden layers and the backpropagation (1986~) • One of the efficient algorithms for multi-layer neural networks is the Backpropagation algorithm • It was re-introduced in 1986 and Neural Networks regained the popularity Note: backpropagation appears to be found by Werbos [1974]; and then independently rediscovered around 1985 by Rumelhart, Hinton, and Williams [1986] and by Parker [1985] Error Caculation∑ Error backpropagation Parameters weights Parameters weights
  22. 22. Example Application: Face Detection Figure 2. System for Neural Network based face detection in the compressed domain Rowley, Henry A., Shumeet Baluja, and Takeo Kanade. "Neural network-based face detection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 20.1 (1998): 23-38. Wang J, Kankanhalli M S, Mulhem P, et al. Face detection using DCT coefficients in MPEG video[C]//Proceedings of the International Workshop on Advanced Imaging Technology (IWAIT 2002). 2002: 66-70.
  23. 23. Example Application: Face Detection Figure 2. System for Neural Network based face detection in the compressed domain Eleftheriadis[13] tackle this problem by including additional face patterns arising from different block alignment positions as positive training examples. But this method induces too many variations, not necessarily belonging to inter-class variations, to the positive training samples. This in turn makes the high frequency DCT coefficients unreliable for both face model training and face detection functions. Fortunately, there exist fast algorithms to calculate reconstructed DCT coefficients for overlapping blocks [16][17][18]. These methods help to calculate DCT coefficients for scan window steps of even upto one pixel. However a good tradeoff between speed and accuracy is obtained by reconstructing the coefficients for scan window steps of two pixels. In order to detect faces of different sizes, compressed Figure 3. Training Data Preparation Suppose x1, x1, x2, Ö, xn are the DCT coefficients retained from the feature extraction stage and x1 (j) ,x2 (j) ,Ö,xn (j) ; j = 1,2,Ö,p are the corresponding DCT coefficients retained from the training examples where n is the number of DCT coefficient features retained(Currently, we use 126 DCT coefficient features extracted from 4_4 block sized square window) and p is the number of training samples. The upper bounds (Ui) and lower bounds (Li) can be estimated by Ui = α*max{1,xi (1) ,Ö,xi (p) }, I = 1,2,Ö,n; (1) Wang J, Kankanhalli M S, Mulhem P, et al. Face detection using DCT coefficients in MPEG video[C]//Proceedings of the International Workshop on Advanced Imaging Technology (IWAIT 2002). 2002: 66-70.
  24. 24. how it works (training) d1 =1 d2 = 0 x1 x2 xm y1 Error Caculation∑ Error backpropagation Parameter s weights Parameter s weights label = Face label = no face Training instances… y0
  25. 25. yk = f2 (netk (1) ) = f2 ( wk, j (2) hj (1) j ∑ )Feed- forward/predict ion: Two-layer feedforward neural network hj 1 hj 1 = f1(netj (1) ) = f1( wj,m (1) xm m ∑ ) x = (x1,..., xm ) yk netk (2) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑ When make a prediciton x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netk (2) net1 (2) where netj (1)
  26. 26. yk = f2 (netk (1) ) = f2 ( wk, j (2) hj (1) j ∑ )Feed- forward/predict ion: Two-layer feedforward neural network hj 1 hj 1 = f1(netj (1) ) = f1( wj,m (1) xm m ∑ ) x = (x1,..., xm ) yk netk (2) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑ When make a prediciton x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netk (2) net1 (2) where netj (1)
  27. 27. yk = f2 (netk (1) ) = f2 ( wk, j (2) hj (1) j ∑ )Feed- forward/predict ion: Two-layer feedforward neural network hj 1 hj 1 = f1(netj (1) ) = f1( wj,m (1) xm m ∑ ) x = (x1,..., xm ) yk netk (2) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑ When make a prediciton x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netk (2) net1 (2) where netj (1)
  28. 28. When backprop/learn the parameters E(W) = 1 2 (yk − dk k ∑ )2 Δwk, j (2) ≡ −η ∂E(W) ∂wk, j (2) = −η(yk − dk ) ∂yk ∂netk (2) ∂netk (2) ∂wk, j (2) =η(dk − yk ) f(2) '(netk (2) )hj (1) =ηδkhj (1) Two layer feedforward neural network x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netj (1) netk (2) net1 (2) netk (2) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑Notation: output neuron: wk, j (2) = wk, j (2) + Δwk, j (2) Error :δk ≡ (dk − yk ) f(2) '(netk (2) ) Δwk, j (2) =ηErrorkOutputj =ηδkhj (1) δk = (dk − yk ) f(2) '(netk (2) ) dk − yk Backprop to learn the parameters: η : the given learning rate
  29. 29. When backprop/learn the parameters E(W) = 1 2 (yk − dk k ∑ )2 Δwk, j (2) ≡ −η ∂E(W) ∂wk, j (2) = −η(yk − dk ) ∂yk ∂netk (2) ∂netk (2) ∂wk, j (2) =η(dk − yk ) f(2) '(netk (2) )hj (1) =ηδkhj (1) Two layer feedforward neural network hj (1) x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netj (1) netk (2) net1 (2) netk (1) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑Notation: output neuron: wk, j (2) = wk, j (2) + Δwk, j (2) Error :δk ≡ (dk − yk ) f(2) '(netk (2) ) Δwk, j (2) =ηErrorkOutputj =ηδkhj (1) δk = (dk − yk ) f(2) '(netk (2) ) dk − yk Backprop to learn the parameters: η : the given learning rate
  30. 30. Multi-layer feedforward neural network E(W) = 1 2 (yk − dk k ∑ )2 Two layer feedforward neural network x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netj (1) netk (2) net1 (2) netk (1) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑Notation: hidden neuron: wj,m (1) = Δwj,m (1) + Δwj,m (1) Error :δj ≡ f(1) '(netj (1) ) (dk − yk ) f(2) '(netk (2) )wk, j (2) ] k ∑ = f(1) '(netj (1) ) δkwk, j (2) k ∑ Δwj,m (1) =ηErrorjOutputm =ηδj xm δk Δwj,m (1) = −η ∂E(W) ∂wj,m (1) = −η ∂E(W) ∂hj (1) ∂hj (1) ∂wj,m (1) =η[ (dk − yk ) f(2) '(netk (1) )wk, j (2) ] k ∑ [xm f(1) '(netj (1) )]=ηδj xm δj = f(1) '(netj (1) ) δkwk, j (2) k ∑ δ1 Backprop to learn the parameters: η : the given learning rate
  31. 31. Multi-layer feedforward neural network E(W) = 1 2 (yk − dk k ∑ )2 Two layer feedforward neural network x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netj (1) netk (2) net1 (2) netk (1) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑Notation: hidden neuron: wj,m (1) = Δwj,m (1) + Δwj,m (1) Error :δj ≡ f(1) '(netj (1) ) (dk − yk ) f(2) '(netk (2) )wk, j (2) ] k ∑ = f(1) '(netj (1) ) δkwk, j (2) k ∑ Δwj,m (1) =ηErrorjOutputm =ηδj xm δk Δwj,m (1) = −η ∂E(W) ∂wj,m (1) = −η ∂E(W) ∂hj (1) ∂hj (1) ∂wj,m (1) =η[ (dk − yk ) f(2) '(netk (1) )wk, j (2) ] k ∑ [xm f(1) '(netj (1) )]=ηδj xm δj = f(1) '(netj (1) ) δkwk, j (2) k ∑ δ1 Backprop to learn the parameters: η : the given learning rate
  32. 32. An example for BackpropFigure 3.4, all the calculations for a reverse pass of Back Propagation. 1. Calculate errors of output neurons δα = outα (1 - outα) (Targetα - outα) δ = out (1 - out ) (Target - out ) Inputs Hidden layer Outputs A B C Ω λ α β WAα WAβ WBα WBβ WCα WCβ WΩA WλA WΩB WλB WΩC WλC https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
  33. 33. An example for BackpropFigure 3.4, all the calculations for a reverse pass of Back Propagation. 1. Calculate errors of output neurons δα = outα (1 - outα) (Targetα - outα) δβ = outβ (1 - outβ) (Targetβ - outβ) 2. Change output layer weights W+ Aα = WAα + ηδα outA W+ Aβ = WAβ + ηδβ outA W+ Bα = WBα + ηδα outB W+ Bβ = WBβ + ηδβ outB W+ Cα = WCα + ηδα outC W+ Cβ = WCβ + ηδβ outC 3. Calculate (back-propagate) hidden layer errors δA = outA (1 – outA) (δαWAα + δβWAβ) δB = outB (1 – outB) (δαWBα + δβWBβ) δC = outC (1 – outC) (δαWCα + δβWCβ) 4. Change hidden layer weights W+ λA = WλA + ηδA inλ W+ ΩA = W+ ΩA + ηδA inΩ W+ λB = WλB + ηδB inλ W+ ΩB = W+ ΩB + ηδB inΩ W+ λC = WλC + ηδC inλ W+ ΩC = W+ ΩC + ηδC inΩ Inputs Hidden layer Outputs A B C Ω λ α β WAα WAβ WBα WBβ WCα WCβ WΩA WλA WΩB WλB WΩC WλC 1. Calculate errors of output neurons δα = outα (1 - outα) (Targetα - outα) δβ = outβ (1 - outβ) (Targetβ - outβ) 2. Change output layer weights W+ Aα = WAα + ηδα outA W+ Aβ = WAβ + ηδβ outA W+ Bα = WBα + ηδα outB W+ Bβ = WBβ + ηδβ outB W+ Cα = WCα + ηδα outC W+ Cβ = WCβ + ηδβ outC 3. Calculate (back-propagate) hidden layer errors δA = outA (1 – outA) (δαWAα + δβWAβ) δB = outB (1 – outB) (δαWBα + δβWBβ) δC = outC (1 – outC) (δαWCα + δβWCβ) 4. Change hidden layer weights W+ λA = WλA + ηδA inλ W+ ΩA = W+ ΩA + ηδA inΩ W+ λB = WλB + ηδB inλ W+ ΩB = W+ ΩB + ηδB inΩ W+ λC = WλC + ηδC inλ W+ ΩC = W+ ΩC + ηδC inΩ The constant η (called the learning rate, and nominally equal to one) is put up or slow down the learning if required. Inputs Hidden layer Outputs C WCα WCβ WΩC WλC Consider sigmoid activation function https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf f 'Sigmoid (x) = fSigmoid (x)(1− fSigmoid (x)) fSigmoid (x) = 1 1+ e−x δk = (dk − yk ) f(2) '(netk (2) ) Δwk, j (2) =ηErrorkOutputj =ηδkhj (1) δj = f(1) '(netj (1) ) δkwk, j (2) k ∑ Δwj,m (1) =ηErrorjOutputm =ηδj xm
  34. 34. Let us do some calculation To illustrate this let’s do a worked Example. Worked example 3.1: Consider the simple network below: Assume that the neurons have a Sigmoid activation function and (i) Perform a forward pass on the network. (ii) Perform a reverse pass (training) once (target = 0.5). (iii) Perform a further forward pass and comment on the result. Answer: (i) Input A = 0.35 Input B = 0.9 Output 0.1 0.8 0.6 0.4 0.3 0.9 https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
  35. 35. Let us do some calculation Assume that the neurons have a Sigmoid activation function and (i) Perform a forward pass on the network. (ii) Perform a reverse pass (training) once (target = 0.5). (iii) Perform a further forward pass and comment on the result. Answer: (i) Input to top neuron = (0.35x0.1)+(0.9x0.8)=0.755. Out = 0.68. Input to bottom neuron = (0.9x0.6)+(0.35x0.4) = 0.68. Out = 0.6637. Input to final neuron = (0.3x0.68)+(0.9x0.6637) = 0.80133. Out = 0.69. (ii) Output error δ=(t-o)(1-o)o = (0.5-0.69)(1-0.69)0.69 = -0.0406. New weights for output layer w1+ = w1+(δ x input) = 0.3 + (-0.0406x0.68) = 0.272392. w2+ = w2+(δ x input) = 0.9 + (-0.0406x0.6637) = 0.87305. Errors for hidden layers: δ1 = δ x w1 = -0.0406 x 0.272392 x (1-o)o = -2.406x10-3 δ2= δ x w2 = -0.0406 x 0.87305 x (1-o)o = -7.916x10-3 New hidden layer weights: w3+ =0.1 + (-2.406 x 10-3 x 0.35) = 0.09916. w4+ = 0.8 + (-2.406 x 10-3 x 0.9) = 0.7978. w5+ = 0.4 + (-7.916 x 10-3 x 0.35) = 0.3972. w6+ = 0.6 + (-7.916 x 10-3 x 0.9) = 0.5928. (iii) Old error was -0.19. New error is -0.18205. Therefore error has reduced. https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
  36. 36. Active functions https://theclevermachine.wordpress.com/tag/tanh-function/ fSigmoid (x) flinear (x) ftanh (x) f 'tanh (x) f 'Sigmoid (x) f 'linear (x)
  37. 37. Activation functions • Logistic Sigmoid: fSigmoid (x) = 1 1+e−x • Output range [0,1] • Motivated by biological neurons and can be interpreted as the probability of an artificial neuron “firing” given its inputs • However, saturatedneurons make gradients vanished (why?) Its derivative: f 'Sigmoid (x) = fSigmoid (x)(1− fSigmoid (x))
  38. 38. Activation functions • Tanh function ftanh (x) = sinh(x) cosh(x) = ex − e−x ex + e−x • Output range [-1,1] • Thus strongly negative inputs to the tanh will map to negative outputs. • Only zero-valued inputs are mapped to near-zero outputs • These properties make the network less likely to get “stuck” during training Its gradient: ftanh '(x) =1− ftanh '(x) Activation Functions tanh(x) - Squashes numbers to range [-1,1] - zero centered (nice) - still kills gradients when saturated :( [LeCun et al., 199 https://theclevermachine.wordpress.com/tag/tanh-function/
  39. 39. Active Functions • ReLU (rectified linear unit) fReLU (x) = max(0,x) • Another version is Noise ReLU: • ReLU can be approximated by softplus function • ReLU gradient doesn't vanish as we increase x • It canbe used to model positive number • It is fast as no need for computing the exponential function • It eliminates the necessity to have a “pretraining” phase The derivative: for x>0, f’=1, and x<0, g’=0. fsoftplus (x) = log(1+ex ) fnoisyReLU (x) = Max(0, x + N(0,δ(x))) http://static.googleusercontent.com/media/research.google. com/en//pubs/archive/40811.pdf
  40. 40. Active Functions • ReLU (rectified linear unit) fReLU (x) = max(0,x) ReLU can be approximated by softplus function fsoftplus (x) = log(1+ex ) http://www.jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf • The only non-linearity comes from the path selection with individual neurons being active or not • It allows sparse representations: • for a given input only a subset of neurons are active Sparse propagation of activations and gradientsAdditional active functions: Leaky ReLU,Exponential LU,Maxout etc
  41. 41. Error/Loss function • Recall stochastic gradient descent – update from a randomly picked example (but in practice do a batch update) • Squared error loss for one binary output: w = w −η ∂E(w) ∂w E(w) = 1 2 (dlabel − f (x;w))2 f (x;w))x input output
  42. 42. Error/Loss function • Softmax (cross-entropy loss) for multiple classes: whereE(w) = − (dk log yk k ∑ +(1− dk )log(1− yk )) x1 x2 wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) One hot encoded class labels net1 (1) net2 (1) netj (1) netk (2) net1 (2) yk = exp( wk, j (2) hj (1) j ∑ ) exp( wk', j (2) hj (1) j ∑ ) k' ∑ (Class labels follow multinomial distribution)
  43. 43. •It requires labeled training data. Most data is unlabeled. •The learning time does not scale well It is very slow in networks with multiple hidden layers. •It can get stuck in poor local optima. What is wrong with backpropagation? Backpropagation: conclusion • Multi layer Perceptron can be trained by the back propagation algorithm to perform any mapping between the input and the output. Advantages Slide based on Geoffrey Hinton’s Hornik, Kurt. "Approximation capabilities of multilayer feedforwardnetworks." Neural networks 4.2 (1991): 251-257. http://www.infoworld.com/article/3003315/big-data/deep- learning-a-brief-guide-for-practical-problem-solvers.html
  44. 44. Summary • Neural networks basic and its backpropagation learning algorithm • Go deeper in layers • Convolutional neural networks (CNN) • Recurrent neural networks (RNN) • Deep reinforcement learning
  45. 45. Why more than one hidden layer? http://www.rsipvision.com/exploring-deep-learning/
  46. 46. Why goes deep now? • Big data everywhere • Demands for: – Unsupervised learning: most of them are unlabeled – End-to-end: trainable feature solution • time consuming for hand-crafted features • Improved computational power: – Widely used GPUs for computing – Distributed computing (mapreduce & spark)
  47. 47. Deep learning problem • Difficult to train – Many many Parameters • Two solutions: – Regularised by prior knowledge (since 1989 ) • Convolutional Neural Networks (CNN) – New method of pre-training using unsupervised methods (since 2006) • Stacked auto-encoders • Stacked Restricted Boltzmann Machines
  48. 48. New methods for unsupervised pre-training “2006” • New methods for unsupervised pre-training 13 • Stacked RBM’s (Deep Belief Networks [DBN’s] ) • Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554. • Hinton, G. E. and Salakhutdinov, R. R, Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006. • (Stacked) Autoencoders • Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007). Greedy Layer-Wise Training of Deep Networks, Advances in Neural Information Processing Systems 19 Stacked Restricted Boltzmann Machines Stacked Autoencoders
  49. 49. Restricted Boltzmann Machines (RBM) = Z e−E(v,h) with the energy function h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . (20) } and j ∈ {1, ..., m}, wij is a real valued weight associated een units Vj and Hi and bj and ci are real valued bias terms jth visible and the ith hidden variable, respectively. ected graph of an RBM with n hidden and m visible variables RBM has only connections between the layer of hidden and 4 Restricted Boltzmann Machines A RBM (also denoted as Harmonium [34]) is an MRF associated tite undirected graph as shown in Fig. 1. It consists of m visib (V1, ..., Vm) to represent observable data and n hidden units H to capture dependencies between observed variables. In binary RB in this tutorial, the random variables (V , H) take values (v, h) and the joint probability distribution under the model is given distribution p(v, h) = 1 Z e−E(v,h) with the energy function E(v, h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei with the edge between units Vj and Hi and bj and ci are real valu associated with the jth visible and the ith hidden variable, respe http://image.diku.dk/igel/paper/AItRBM-proof.pdf • A RBM is a bipartite (two disjointedset)undirected graph and used as probabilistic generative models – contains m visible units V = (V1, ..., Vm) to represent observable data and – n hidden units H = (H1, ..., Hn) to capture their dependencies • Assume the random variables (V , H) take values (v, h) ∈ {0, 1}m+n i This equation shows why a (marginalized) RBM can be regarded as a product of expert 58], in which a number of “experts” for the individual components of the observations a multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visib hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl 58], in which a number of “experts” for the individual components of the observations multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visi hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl es is an MRF associated with a bipartite undirected graph as V = (V1, . . . , Vm) representing the observable data, and n e dependencies between the observed variables. In binary variables (V , H) take values (v, h) ∈ {0, 1}m+n and the el is given by the Gibbs distribution p(v, h) = 1 Z e−E(v,h) 1 wijhivj − m j=1 bjvj − n i=1 cihi . (18) is a real valued weight associated with the edge between alued bias terms associated with the jth visible and the Z = v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be rais and biases to lower the energy of that image and to raise the energy of oth that have low energies and therefore make a big contribution to the partitio of the log probability of a training vector with respect to a weight is surp @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the dis subscript that follows. This leads to a very simple learning rule for perf ascent in the log probability of the training data: wij = ✏(hvihjidata hvihjimodel) where ✏ is a learning rate. p(v, h) = 1 Z e E(v,h) where the “partition function”, Z, is given by summing over all possible pair vectors: Z = X v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by su hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be raised b and biases to lower the energy of that image and to raise the energy of other i that have low energies and therefore make a big contribution to the partition f of the log probability of a training vector with respect to a weight is surprisin @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the distrib subscript that follows. This leads to a very simple learning rule for perform Joint distribution: , where the energy function sig(): sigmoid We learn the parameters wij by maximising a log-likelihood given example v: No connections within a layer (restricted) leads to <*> data : expectations under the data empirical distribution <*> model : expectations under the model Hinton, Geoffrey. "A practical guide to training restricted Boltzmann machines." Momentum 9.1 (2010): 926.
  50. 50. Restricted Boltzmann Machines (RBM) = Z e−E(v,h) with the energy function h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . (20) } and j ∈ {1, ..., m}, wij is a real valued weight associated een units Vj and Hi and bj and ci are real valued bias terms jth visible and the ith hidden variable, respectively. ected graph of an RBM with n hidden and m visible variables RBM has only connections between the layer of hidden and 4 Restricted Boltzmann Machines A RBM (also denoted as Harmonium [34]) is an MRF associated tite undirected graph as shown in Fig. 1. It consists of m visib (V1, ..., Vm) to represent observable data and n hidden units H to capture dependencies between observed variables. In binary RB in this tutorial, the random variables (V , H) take values (v, h) and the joint probability distribution under the model is given distribution p(v, h) = 1 Z e−E(v,h) with the energy function E(v, h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei with the edge between units Vj and Hi and bj and ci are real valu associated with the jth visible and the ith hidden variable, respe http://image.diku.dk/igel/paper/AItRBM-proof.pdf • A RBM is a bipartite (two disjointedset)undirected graph and used as probabilistic generative models – contains m visible units V = (V1, ..., Vm) to represent observable data and – n hidden units H = (H1, ..., Hn) to capture their dependencies • Assume the random variables (V , H) take values (v, h) ∈ {0, 1}m+n i This equation shows why a (marginalized) RBM can be regarded as a product of expert 58], in which a number of “experts” for the individual components of the observations a multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visib hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl 58], in which a number of “experts” for the individual components of the observations multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visi hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl es is an MRF associated with a bipartite undirected graph as V = (V1, . . . , Vm) representing the observable data, and n e dependencies between the observed variables. In binary variables (V , H) take values (v, h) ∈ {0, 1}m+n and the el is given by the Gibbs distribution p(v, h) = 1 Z e−E(v,h) 1 wijhivj − m j=1 bjvj − n i=1 cihi . (18) is a real valued weight associated with the edge between alued bias terms associated with the jth visible and the Z = v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be rais and biases to lower the energy of that image and to raise the energy of oth that have low energies and therefore make a big contribution to the partitio of the log probability of a training vector with respect to a weight is surp @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the dis subscript that follows. This leads to a very simple learning rule for perf ascent in the log probability of the training data: wij = ✏(hvihjidata hvihjimodel) where ✏ is a learning rate. p(v, h) = 1 Z e E(v,h) where the “partition function”, Z, is given by summing over all possible pair vectors: Z = X v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by su hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be raised b and biases to lower the energy of that image and to raise the energy of other i that have low energies and therefore make a big contribution to the partition f of the log probability of a training vector with respect to a weight is surprisin @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the distrib subscript that follows. This leads to a very simple learning rule for perform Joint distribution: , where the energy function sig(): sigmoid We learn the parameters wij by maximising a log-likelihood given example v: No connections within a layer (restricted) leads to <*> data : expectations under the data empirical distribution <*> model : expectations under the model Hinton, Geoffrey. "A practical guide to training restricted Boltzmann machines." Momentum 9.1 (2010): 926.
  51. 51. Restricted Boltzmann Machines (RBM) = Z e−E(v,h) with the energy function h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . (20) } and j ∈ {1, ..., m}, wij is a real valued weight associated een units Vj and Hi and bj and ci are real valued bias terms jth visible and the ith hidden variable, respectively. ected graph of an RBM with n hidden and m visible variables RBM has only connections between the layer of hidden and 4 Restricted Boltzmann Machines A RBM (also denoted as Harmonium [34]) is an MRF associated tite undirected graph as shown in Fig. 1. It consists of m visib (V1, ..., Vm) to represent observable data and n hidden units H to capture dependencies between observed variables. In binary RB in this tutorial, the random variables (V , H) take values (v, h) and the joint probability distribution under the model is given distribution p(v, h) = 1 Z e−E(v,h) with the energy function E(v, h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei with the edge between units Vj and Hi and bj and ci are real valu associated with the jth visible and the ith hidden variable, respe http://image.diku.dk/igel/paper/AItRBM-proof.pdf • A RBM is a bipartite (two disjointedset)undirected graph and used as probabilistic generative models – contains m visible units V = (V1, ..., Vm) to represent observable data and – n hidden units H = (H1, ..., Hn) to capture their dependencies • Assume the random variables (V , H) take values (v, h) ∈ {0, 1}m+n i This equation shows why a (marginalized) RBM can be regarded as a product of expert 58], in which a number of “experts” for the individual components of the observations a multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visib hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl 58], in which a number of “experts” for the individual components of the observations multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visi hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl es is an MRF associated with a bipartite undirected graph as V = (V1, . . . , Vm) representing the observable data, and n e dependencies between the observed variables. In binary variables (V , H) take values (v, h) ∈ {0, 1}m+n and the el is given by the Gibbs distribution p(v, h) = 1 Z e−E(v,h) 1 wijhivj − m j=1 bjvj − n i=1 cihi . (18) is a real valued weight associated with the edge between alued bias terms associated with the jth visible and the Z = v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be rais and biases to lower the energy of that image and to raise the energy of oth that have low energies and therefore make a big contribution to the partitio of the log probability of a training vector with respect to a weight is surp @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the dis subscript that follows. This leads to a very simple learning rule for perf ascent in the log probability of the training data: wij = ✏(hvihjidata hvihjimodel) where ✏ is a learning rate. p(v, h) = 1 Z e E(v,h) where the “partition function”, Z, is given by summing over all possible pair vectors: Z = X v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by su hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be raised b and biases to lower the energy of that image and to raise the energy of other i that have low energies and therefore make a big contribution to the partition f of the log probability of a training vector with respect to a weight is surprisin @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the distrib subscript that follows. This leads to a very simple learning rule for perform Joint distribution: , where the energy function sig(): sigmoid We learn the parameters wij by maximising a log-likelihood given example v: No connections within a layer (restricted) leads to <*> data : expectations under the data empirical distribution <*> model : expectations under the model Hinton, Geoffrey. "A practical guide to training restricted Boltzmann machines." Momentum 9.1 (2010): 926.
  52. 52. Restricted Boltzmann Machines (RBM) = Z e−E(v,h) with the energy function h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . (20) } and j ∈ {1, ..., m}, wij is a real valued weight associated een units Vj and Hi and bj and ci are real valued bias terms jth visible and the ith hidden variable, respectively. ected graph of an RBM with n hidden and m visible variables RBM has only connections between the layer of hidden and 4 Restricted Boltzmann Machines A RBM (also denoted as Harmonium [34]) is an MRF associated tite undirected graph as shown in Fig. 1. It consists of m visib (V1, ..., Vm) to represent observable data and n hidden units H to capture dependencies between observed variables. In binary RB in this tutorial, the random variables (V , H) take values (v, h) and the joint probability distribution under the model is given distribution p(v, h) = 1 Z e−E(v,h) with the energy function E(v, h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei with the edge between units Vj and Hi and bj and ci are real valu associated with the jth visible and the ith hidden variable, respe http://image.diku.dk/igel/paper/AItRBM-proof.pdf • A RBM is a bipartite (two disjointedset)undirected graph and used as probabilistic generative models – contains m visible units V = (V1, ..., Vm) to represent observable data and – n hidden units H = (H1, ..., Hn) to capture their dependencies • Assume the random variables (V , H) take values (v, h) ∈ {0, 1}m+n i This equation shows why a (marginalized) RBM can be regarded as a product of expert 58], in which a number of “experts” for the individual components of the observations a multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visib hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl 58], in which a number of “experts” for the individual components of the observations multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visi hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl es is an MRF associated with a bipartite undirected graph as V = (V1, . . . , Vm) representing the observable data, and n e dependencies between the observed variables. In binary variables (V , H) take values (v, h) ∈ {0, 1}m+n and the el is given by the Gibbs distribution p(v, h) = 1 Z e−E(v,h) 1 wijhivj − m j=1 bjvj − n i=1 cihi . (18) is a real valued weight associated with the edge between alued bias terms associated with the jth visible and the Z = v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be rais and biases to lower the energy of that image and to raise the energy of oth that have low energies and therefore make a big contribution to the partitio of the log probability of a training vector with respect to a weight is surp @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the dis subscript that follows. This leads to a very simple learning rule for perf ascent in the log probability of the training data: wij = ✏(hvihjidata hvihjimodel) where ✏ is a learning rate. p(v, h) = 1 Z e E(v,h) where the “partition function”, Z, is given by summing over all possible pair vectors: Z = X v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by su hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be raised b and biases to lower the energy of that image and to raise the energy of other i that have low energies and therefore make a big contribution to the partition f of the log probability of a training vector with respect to a weight is surprisin @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the distrib subscript that follows. This leads to a very simple learning rule for perform Joint distribution: , where the energy function sig(): sigmoid We learn the parameters wij by maximising a log-likelihood given example v: No connections within a layer (restricted) leads to <*> data : expectations under the data empirical distribution <*> model : expectations under the model Hinton, Geoffrey. "A practical guide to training restricted Boltzmann machines." Momentum 9.1 (2010): 926. Can be approximated by Contrastive Divergence
  53. 53. Stacked RBM • The learned feature activations of one RBM are used as the ‘‘data’’ for training the next RBM in the stack. • After the pretraining, the RBMs are ‘‘unrolled’’ to create a deep autoencoder, which is then fine-tuned using backpropagation of error derivatives. Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507.
  54. 54. Stacked RBM • The learned feature activations of one RBM are used as the ‘‘data’’ for training the next RBM in the stack. • After the pretraining, the RBMs are ‘‘unrolled’’ to create a deep autoencoder, which is then fine-tuned using backpropagation of error derivatives. Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507. • Unsupervised learning resultsFig. 4. (A) The fraction of retrieved documents in the same class as the query when a query document from the test set is used to retrieve other test set documents, averaged over all 402,207 possible que- ries. (B) The codes produced by two-dimensional LSA. (C) The codes producedby a2000- 500-250-125-2 autoencoder. 28 JULY 2006 VOL 313 SCIENCE www.sciencemag.org6 tion of in the y when om the e other veraged le que- oduced SA. (C) a2000- ncoder. 28 JULY 2006 VOL 313 SCIENCE www.sciencemag.org The codes produced by two-dimensional LSA (Latent Semantic Analysis). The codes produced by a 2000- 500-250-125-2 StackedRBM
  55. 55. Auto-encoder • Multi-layer neural network with target output and input: – Reconstruction xo = decoder(encoder(input xi)) – Minimizing reconstruction error (xi - xo)2 http://deeplearning.net/tutorial/dA.html h = f (We T xi + be ) xo = f (Wd T h + bd ) encoder decoder xo h xi f :sigmoidSetting : Tied weights We T = Wd , • The goal is to learn We, be, bd such that the average reconstruction error is minimised • If f is chosen to be linear, auto-encoder is equivalent to PCA; if non-linear, it captures multi-modal aspects of the input distribution
  56. 56. Denoising auto-encoder • A stochastic version of the auto-encoder. • We train the autoencoder to reconstruct the input from a corrupted version of it , so that – force the hidden layer to discover more robust features and – prevent it from simply learning the identity http://deeplearning.net/tutorial/dA.html encoder decoder xo y Corrupted xi • The stochastic corruption process : randomly sets some of the inputs (as many as half of them) to zero • The goal is to predict the missing values from the uncorrupted (i.e., non-missing) values
  57. 57. Stacked Denoising Auto-encoder Pretraining: Stacked Denoising Auto-encoder 15 • Stacking Auto-Encoders from: Bengio ICML 2009 Parameter matrix W’ is transpose of W Bengio Y, Lamblin P, Popovici D, et al. Greedy layer-wise training of deep networks[J]. Advances in neural information processing systems, 2007, 19: 153. Unsupervised greedy layer-wise training procedure (1) The first layer pre-training (2) The second layer pre-training (3) The while network fine- tuning by label
  58. 58. Convolutional neural networks: Receptive field • Receptive field: Neurons in the retina respond to light stimulus in restricted regions of the visual field • Animal experiments on receptive fields of two retinal ganglion cells – Fields are circular areas of the retina – The cell (upper part) responds when the center is illuminated and the surround is darkened. – The cell (lower part) responds when the center is darkened and the surround is illuminated. – Both cells give on- and off- responses when both center and surround are illuminated, but neither response is as strong as when only center or surround is illuminated Hubel D.H. : The Visual Cortex of the Brain Sci Amer 209:54-62, 1963 Contributed by Hubel and Wiesel for the studies of the visual system of a cat “On” Center Field “Off” Center Field Light On
  59. 59. Convolutional neural networks • Sparse connectivity by local correlation – Filter: the input of a hidden unit in layer m are from a subset of units in layer m-1 that have spatially connected receptive fields • Shared weights – each filter is replicated across the entire visual field. These replicated units share the same weights and form a feature map. http://deeplearning.net/tutorial/lenet.html Translational invariance with convolutional neural networks e previous section, we see that using locality structures significantly reduces the number of connections n further reduction can be achieved via another technique called “weight sharing.” In weight sharing e of the parameters in the model are constrained to be equal to each other. For example, in the following of a network, we have the following constraints w1 = w4 = w7, w2 = w5 = w8 and w3 = w6 = w9 es that have the same color have the same weight): edges that have the same color have the same weight 2-d case (subscripts are weights) 1-d case m layer m-1 layer m-1 layer one filter at m layer
  60. 60. LeNet 5 INPUT 32x32 Convolutions SubsamplingConvolutions C1: feature maps 6@28x28 Subsampling S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84 Full connection Full connection Gaussian connections OUTPUT 10 LeCun Y, Boser B, Denker J S, et al. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989, 1(4): 541-551. • LeNet 5 consists of 7 layers, all of which contain trainable weights; • End-to end solution: they are used to replace hand crafted feature extraction • Cx: convolution layers with 5x5 filter • Sx: subsampling (pooling) layers • Fx: fully-connected layers
  61. 61. • Example: a 10x10 input image with a 3x3 filter result in a 8x8 output image LeNet 5: Convolution Layer f Input image Feature map10x10 filter 3x3 8x8 Activation function 61Convolutions Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998
  62. 62. LeNet 5: Convolution Layer 62 Activation functionInput image Feature map10x10 8x8 Response kernel 3x3 f f f Convolutions • Example: a 10x10 input image with a 3x3 filter result in a 8x8 output image • 3 different filters (weights are different) lead to 3 8x8 out images. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998
  63. 63. LeNet 5: (Sampling) Pooling Layer Max pooling Max in a 2x2 filter Average pooling Average in a 2x2 filter 63Sampling • Pooling: partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum or average value. Max pooling • reduces computation and • is a way of taking the most responsive node of the given interest region, • but may result in loss of accurate spatial information Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998
  64. 64. LeNet 5 results • MNIST (handwritten digits) Dataset: – 60k training and 10k test examples • Test error rate 0.95% Training error (no distortions) Test error (no distortions) Test error (with distortions) Training Set Size (x1000) 20 30 40 50 60 70 80 90 100 ate (%) 4−>6 3−>5 8−>2 2−>1 5−>3 4−>8 2−>8 3−>5 6−>5 7−>3 9−>4 8−>0 7−>8 5−>3 8−>7 0−>6 3−>7 2−>7 8−>3 9−>4 8−>2 5−>3 4−>8 3−>9 6−>0 9−>8 4−>9 6−>1 9−>4 9−>1 9−>4 2−>0 6−>1 3−>5 3−>2 9−>5 6−>0 6−>0 6−>0 6−>8 4−>6 7−>3 9−>4 4−>6 2−>7 9−>7 4−>3 9−>4 9−>4 9−>4 8−>7 4−>2 8−>4 3−>5 8−>4 6−>5 8−>5 3−>8 3−>8 9−>8 1−>5 9−>8 6−>3 0−>2 6−>5 9−>5 0−>7 1−>6 4−>9 2−>1 2−>8 8−>5 4−>9 7−>2 7−>2 6−>5 9−>7 6−>1 5−>6 5−>0 4−>9 2−>8 Total only 82 errors from LeNet-5. correct answer left and right is the machine answer. 4 8 3 C1 S2 C3 S4 C5 F6 Output 3 4 4 4 4 34 8 3 C1 S2 C3 S4 C5 F6 Output 4 8 3 C1 S2 C3 S4 C5 F6 Output 3 4 4 4 4 34 8 3 C1 S2 C3 S4 C5 F6 Output Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998 http://yann.lecun.com/exdb/mnist/
  65. 65. From MNIST to ImageNet • ImageNet – Over 15M labeled high resolution images – Roughly 22K categories – Collected from web and labeled by Amazon Mechanical Turk • The Image/scene classification challenge – Image/scene classification – Metric: Hit@5 error rate - make 5 guesses about the image label http://cognitiveseo.com/blog/6511/will-google-read-rank-images-near-future/ 12 Olga Russakovsky* et al. Fig. 4 Random selection of images in ILSVRC detection validation set. The images in the top 4 rows were taken from ILSVRC2012 single-object localization validation set, and the images in the bottom 4 rows were collected from Flickr using scene-level queries. tage of all the positive examples available. The second source (24%) is negative images which were part of the original ImageNet collection process but voted as neg- ative: for example, some of the images were collected from Flickr and search engines for the ImageNet synset “animals” but during the manual verification step did not collect enough votes to be considered as containing an “animal.” These images were manually re-verified for the detection task to ensure that they did not in fact contain the target objects. The third source (13%) is images collected from Flickr specifically for the de- tection task. These images were added for ILSVRC2014 following the same protocol as the second type of images in the validation and test set. This was done to bring the training and testing distributions closer together. Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252. Hit@5 error rate
  66. 66. Leadertable (ImageNet image classification)24 Olga Russakov Image classification Year Codename Error (percent) 99.9% Conf Int 2014 GoogLeNet 6.66 6.40 - 6.92 2014 VGG 7.32 7.05 - 7.60 2014 MSRA 8.06 7.78 - 8.34 2014 AHoward 8.11 7.83 - 8.39 2014 DeeperVision 9.51 9.21 - 9.82 2013 Clarifai† 11.20 10.87 - 11.53 2014 CASIAWS† 11.36 11.03 - 11.69 2014 Trimps† 11.46 11.13 - 11.80 2014 Adobe† 11.58 11.25 - 11.91 2013 Clarifai 11.74 11.41 - 12.08 2013 NUS 12.95 12.60 - 13.30 2013 ZF 13.51 13.14 - 13.87 2013 AHoward 13.55 13.20 - 13.91 2013 OverFeat 14.18 13.83 - 14.54 2014 Orange† 14.80 14.43 - 15.17 2012 SuperVision† 15.32 14.94 - 15.69 2012 SuperVision 16.42 16.04 - 16.80 2012 ISI 26.17 25.71 - 26.65 2012 VGG 26.98 26.53 - 27.43 2012 XRCE 27.06 26.60 - 27.52 2012 UvA 29.58 29.09 - 30.04 Single-object localization Year Codename Error (percent) 99.9% Conf Int 2014 VGG 25.32 24.87 - 25.78 2014 GoogLeNet 26.44 25.98 - 26.92 Fig. 10 For each object class, we consider the b mance of any entry submitted to ILSVRC2012-20 ing entries using additional training data. The the distribution of these “optimistic” per-class resu mance is measured as accuracy for image classific and for single-object localization (middle), and precision for object detection (right). While the very promising in image classification, the ILSVR are far from saturated: many object classes cont challenging for current algorithms. dence interval. We run a large number of b ping rounds (from 20,000 until convergence) shows the results of the top entries to eac ILSVRC2012-2014. The winning methods a tically significantly di↵erent from the other even at the 99.9% level.Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252. Unofficial human error is around 5.1% on a subset Why human error still? When labeling, human raters judged whether it belongs to a class (binary classification); the challenge is a 1000-class classification problem. 2015 ResNet (ILSVRC’15) 3.57 GoogLeNet, 22 layers network Microsoft ResNet, a 152 layers network U. of Toronto, SuperVision, a 7 layers network http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/)
  67. 67. SuperVision Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000. neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems. 2012: 1097-1105. • Rooted from leNet-5 • Rectified Linear Units, overlapping pooling, dropout trick • Randomly extracted 224x224 patches from 256x256 images for more training data
  68. 68. dropout • Dropout randomly ‘drops’ units from a layer on each training step, creating ‘sub- architectures’ within the model. • It can be viewed as a type of sampling of a smaller network within a larger network • Prevent Neural Networks from Overfitting Srivastava, Nitish, et al. "Dropout: A simple way to prevent neural networks from overfitting." The Journal of Machine Learning Research 15.1 (2014): 1929-1958.
  69. 69. Go deeper with CNN • GoogLeNet, a 22 layers deep network • Increasing the size of NN results in two difficulties – 1) overfittingand 2) high computation cost • Solution: a sparse deep neural network heclassifier(pre- ainclassifier,but orkisdepictedin ingtheDistBe- temusingmod- m.Althoughwe aroughestimate uldbetrainedto ithinaweek,the Ourtrainingused with0.9momen- reasingthelearn- eraging[13]was encetime. gedsubstantially ion,andalready eroptions,some- parameters,such ore,itishardto ctivesingleway ersfurther,some errelativecrops, ill,oneprescrip- fterthecompeti- atchesoftheim- en8%and100% inedtotheinter- metricdistortions batoverfittingto nChallenge ngeinvolvesthe 00leaf-nodecat- reabout1.2mil- tionand100,000 ociatedwithone measuredbased ons.Twonum- racyrate,which tpredictedclass, thegroundtruth mageisdeemed amongthetop-5, geusesthetop-5 input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv 3x3+1(S) LocalRespNorm MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) AveragePool 7x7+1(V) FC Conv 1x1+1(S) FC FC SoftmaxActivation softmax0 Conv 1x1+1(S) FC FC SoftmaxActivation softmax1 SoftmaxActivation softmax2 Figure3:GoogLeNetnetworkwithallthebellsandwhistles. AveragePool 7x7+1(V) FC SoftmaxActivation softmax2 input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv 3x3+1(S) https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Figure3:GoogLeNetnetworkwithallthebellsandwhistles 7 Their filters banks concatenated into a single output vector for the next layer
  70. 70. ResNet: Deep Residual Learning • Overfitting prevents a neural network to go deeper • A Thought experiment: – We have a shallower architecture(e.g. logistic regression) – Add more layers onto it to construct its deeper counterpart – So a deeper model should produce no higher training error than its shallower counterpart, if • the added layers are identity mapping, and the other layers are copied from the learned shallower model. • However, it is reported that adding more layers to a very deep model leads to higher training error He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015). https://github.com/KaimingHe/deep-residual-networks al Learning for Image Recognition angyu Zhang Shaoqing Ren Jian Sun Microsoft Research xiangz, v-shren, jiansun}@microsoft.com to train. We the training n those used ers as learn- er inputs, in- provide com- hese residual 0 1 2 3 4 5 6 0 10 20 iter. (1e4) trainingerror(%) 0 1 2 3 4 5 6 0 10 20 iter. (1e4) testerror(%) 56-layer 20-layer 56-layer 20-layer Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena esidual Learning for Image Recognition Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research {kahe, v-xiangz, v-shren, jiansun}@microsoft.com e difficult to train. We rk to ease the training eeper than those used e the layers as learn- to the layer inputs, in- ons. We provide com- ng that these residual 0 1 2 3 4 5 6 0 10 20 iter. (1e4) trainingerror(%) 0 1 2 3 4 5 6 0 10 20 iter. (1e4) testerror(%) 56-layer 20-layer 56-layer 20-layer Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena
  71. 71. • Solution: feedforwardneural networks with “shortcut connections” • This leads to incredible 152-layer deep convolution network, driving the error to 3.57% in the classification! ResNet: Deep Residual Learning identity weight layer weight layer relu relu F(x) + x x F(x) x Figure 2. Residual learning: a building block. re comparably good or better than the constructed solution or unable to do so in feasible time). In this paper, we address the degradation problem by ntroducing a deep residual learning framework. In- ead of hoping each few stacked layers directly fit a esired underlying mapping, we explicitly let these lay- rs fit a residual mapping. Formally, denoting the desired nderlying mapping as H(x), we let the stacked nonlinear ayers fit another mapping of F(x) := H(x) x. The orig- nal mapping is recast into F(x)+x. We hypothesize that it easier to optimize the residual mapping than to optimize ImageNet test set, and won the 1st place in 2015 classification competition. The extrem resentations also have excellent generalization on other recognition tasks, and lead us to fu 1st places on: ImageNet detection, ImageNe COCO detection, and COCO segmentation i COCO 2015 competitions. This strong eviden the residual learning principle is generic, and w it is applicable in other vision and non-vision 2. Related Work Residual Representations. In image recogn [18] is a representation that encodes by the re with respect to a dictionary, and Fisher Vecto formulated as a probabilistic version [18] of of them are powerful shallow representations trieval and classification [4, 48]. For vector encoding residual vectors [17] is shown to b tive than encoding original vectors. He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015). https://github.com/KaimingHe/deep-residual-networks 7x7conv,64,/2 pool,/2 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,128,/2 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,256,/2 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,512,/2 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512 avgpool fc1000 image 3x3conv,512 3x3conv,64 3x3conv,64 pool,/2 3x3conv,128 3x3conv,128 pool,/2 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 pool,/2 3x3conv,512 3x3conv,512 3x3conv,512 pool,/2 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512 pool,/2 fc4096 fc4096 fc1000 image output size:112 output size:224 output size:56 output size:28 output size:14 output size:7 output size:1 VGG-1934-layerplain 7x7conv,64,/2 pool,/2 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,128,/2 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,256,/2 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,512,/2 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512 avgpool fc1000 image 34-layerresidual Figure3.ExamplenetworkarchitecturesforImageNet.Left:the VGG-19model[41](19.6billionFLOPs)asareference.Mid- dle:aplainnetworkwith34parameterlayers(3.6billionFLOPs). Right:aresidualnetworkwith34parameterlayers(3.6billion FLOPs).Thedottedshortcutsincreasedimensions.Table1shows moredetailsandothervariants. ResidualN insertshort networkint shortcuts(E outputareo Fig.3).Wh inFig.3),w performsid forincreasi parameter;( matchdime options,wh sizes,theya 3.4.Imple Ourimp in[21,41]. domlysamp A224⇥224 horizontalfl standardco normalizati beforeactiv asin[13]an useSGDw startsfrom0 andthemod useaweigh donotused Intesting 10-croptes convolution atmultiple sideisin{2 4.Experi 4.1.Image Weevalu cationdatas aretrained atedonthe resultonth Weevaluate PlainNetw plainnets.T 18-layerpla tailedarchit Theresu nethashigh plainnet.T paretheirtr cedure.We 4 1. the desired underlying mapping as H(x). 2. In stead of fitting H(x), a stack of nonlinear layers fit another mapping of F(x) := H(x)−x. 3. The original mapping is recast into F(x)+x 4. During training, it would be easy to push the residual (F(x)) to zero (create an identity mapping) by a stack of nonlinear layers. 0 1 2 3 4 5 6 0 5 10 20 iter. (1e4) error(%) plain-20 plain-32 plain-44 plain-56 0 1 2 3 4 5 6 0 5 10 20 iter. (1e4) error(%) ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-11056-layer 20-layer 110-layer 20-layer Figure 6. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing erro of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 120
  72. 72. Summary • Neural networks basic and its backpropagation learning algorithm • Go deeper in layers • Convolutional neural networks (CNN) • Recurrent neural networks (RNN) • Deep reinforcement learning
  73. 73. Recurrent neural networks • Why we need another NN model – Sequential prediction – Time series • A simple recurrent neural network (RNN) – Training can be done by BackPropagation Through Time (BPTT) http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ s = f (xU) o = f (sV) x :input vector, o :output vector, s : hidden state vector, U : layer 1 param. matrix, V : layer 2 param. matrix, f : tanh or ReLU Two-layer feedforward network Add time-dependency of the hidden state s W : State transition param. matrix st+1 = f (xt+1U+ stW) ot+1 = f (st+1V)
  74. 74. LSTMs (Long Short Term Memory networks) • Standard Recurrent Network(SRN) has a problem of learning long-term dependency (due to vanishing gradients of saturated nodes) • a LSTM provides another way to compute a hidden state Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/ st−1 st xt ct−1 ct ot ... f i o input gate forget gate output gate “candidate” hidden state: Cell internal memory Hidden state σ :sigmoid (control signal between 0 and 1); o: elementwise multiplication st−1 xt ot... st st = tanh(xtU+ st−1W) SRN cell LSTM cell http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  75. 75. • ClockWork-RNN is similar to a simple RNN with an input, output and hidden layer • Difference lies in – The hidden layer is partitioned into g modules each with its own clock rate – Neurons in faster module are connected to neurons in a slower module RNN applications: time series Koutnik, Jan, et al. "A clockwork rnn." arXiv preprint arXiv:1402.3511 (2014). A Clockwork RNN Figure 1. CW-RNN architecture is similar to a simple RNN with an input, output and hidden layer. The hidden layer is partitioned into g modules each with its own clock rate. Within each module the neurons are fully interconnected. Neurons in faster module i are connected to neurons in a slower module j only if a clock period Ti < Tj. 2. Related Work Contributions to sequence modeling and recognition that are relevant to CW-RNNs are presented in this section. The pri- mary focus is on RNN extensions that deal with the problem of bridging long time lags. Hierarchical Recurrent Neural Networks (Hihi & Bengio, 1996) are the most similar to CW-RNNs work, having both delayed connections and units operating at different time- scales. Five hand-designed architectures were tested on have been very successful recently in speech and handwrit- ing recognition (Graves et al., 2005; 2009; Sak et al., 2014). Stacking LSTMs (Fernandez et al., 2007; Graves & Schmidhuber, 2009) into a hierarchical sequence proces- sor, equipped with Connectionist Temporal Classification (CTC; Graves et al., 2006), performs simultaneous segmen- tation and recognition of sequences. Its deep variant cur- rently holds the state-of-the-art result in phoneme recogni- tion on the TIMIT database (Graves et al., 2013). A Clockwork RNN
  76. 76. Word embedding • From bag of word to word embedding – Use a a real-valued vector in Rm to represent a word (concept) v(‘‘cat")=(0.2, -0.4, 0.7, ...) v(‘‘mat")=(0.0, 0.6, -0.1, ...) • Continuous bag of word (CBOW) model (word2vec) – Input/output words x/y are one-hot encoded – Hidden layer is shared for all input words Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013). Continuous bag of word (CBOW) model Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014). where C is the number of words in the context, w1, · · · , wC are the words the and vw is the input vector of a word w. The loss function is E = = log p(wO|wI,1, · · · , wI,C) = uj⇤ + log VX j0=1 exp(uj0 ) = v0 wO T · h + log VX j0=1 exp(v0 wj T · h) which is the same as (7), the objective of the one-word-context model, ex di↵erent, as defined in (18) instead of (1). Input layer Hidden layer Output layer WV×N WV×N WV×N W'N×V yjhi x2k x1k xCk C×V-dim N-dim V-dim Figure 2: Continuous bag-of-word model The update equation for the hidden!output weights stay the same a one-word-context model (11). We copy it here: v0 wj (new) = v0 wj (old) ⌘ · ej · h for j = 1, 2, · · · , V. Note that we need to apply this to every element of the hidden!output we each training instance. prediction error, the more significant e↵ects a word will exert on the movement on the input vector of the context word. As we iteratively update the model parameters by going through context-target word pairs generated from a training corpus, the e↵ects on the vectors will accumulate. We can imagine that the output vector of a word w is “dragged” back-and-forth by the input vectors of w’s co-occurring neighbors, as if there are physical strings between the vector of w and the vectors of its neighbors. Similarly, an input vector can also be considered as being dragged by many output vectors. This interpretation can remind us of gravity, or force-directed graph layout. The equilibrium length of each imaginary string is related to the strength of cooccurrence between the associated pair of words, as well as the learning rate. After many iterations, the relative positions of the input and output vectors will eventually stabilize. 1.2 Multi-word context Figure 2 shows the CBOW model with a multi-word context setting. When computing the hidden layer output, instead of directly copying the input vector of the input context word, the CBOW model takes the average of the vectors of the input context words, and use the product of the input!hidden weight matrix and the average vector as the output. h = 1 C W · (x1 + x2 + · · · + xC) (17) = 1 C · (vw1 + vw2 + · · · + vwC ) (18) 5 Hidden nodes: The cross-entropy loss: j j0=1 j = v0 wO T · h + log VX j0=1 exp(v0 wj T · h) (21) which is the same as (7), the objective of the one-word-context model, except that h is di↵erent, as defined in (18) instead of (1). Figure 2: Continuous bag-of-word model The update equation for the hidden!output weights stay the same as that for the one-word-context model (11). We copy it here: v0 wj (new) = v0 wj (old) ⌘ · ej · h for j = 1, 2, · · · , V. (22) Note that we need to apply this to every element of the hidden!output weight matrix for each training instance. 6 The update equation for input!hidden weights is similar to (16), except that now we need to apply the following equation for every word wI,c in the context: v(new) wI,c = v(old) wI,c 1 C · ⌘ · EH for c = 1, 2, · · · , C. (23) where vwI,c is the input vector of the c-th word in the input context; ⌘ is a positive learning rate; and EH = @E @hi is given by (12). The intuitive understanding of this update equation is the same as that for (16). The gradient updates: N-dim Vector representation of a word V: vocabulary size; C: num. input words; v: row vector of input matrix W; v’: row vector of output matrix W’ where C is the number of words in the context, w1, · · · , wC are the words the in the context, and vw is the input vector of a word w. The loss function is E = = log p(wO|wI,1, · · · , wI,C) (19) = uj⇤ + log VX j0=1 exp(uj0 ) (20) = v0 wO T · h + log VX j0=1 exp(v0 wj T · h) (21) which is the same as (7), the objective of the one-word-context model, except that h is di↵erent, as defined in (18) instead of (1). where C is the number of words in the context, w1, · · · , wC are the words the in the context, and vw is the input vector of a word w. The loss function is E = = log p(wO|wI,1, · · · , wI,C) (19) = uj⇤ + log VX j0=1 exp(uj0 ) (20) = v0 wO T · h + log VX j0=1 exp(v0 wj T · h) (21) which is the same as (7), the objective of the one-word-context model, except that h is di↵erent, as defined in (18) instead of (1). or v0 wj (new) = v0 wj (old) ⌘ · ej · h for j = 1, 2, · · · , V. where ⌘ > 0 is the learning rate, ej = yj tj, and hi is the i-th unit in the hidden v0 wj is the output vector of wj. Note that this update equation implies that we ha go through every possible word in the vocabulary, check its output probability yj compare yj with its expected output tj (either 0 or 1). If yj > tj (“overestimating”) we subtract a proportion of the hidden vector h (i.e., vwI ) from v0 wO , thus making farther away from vwI ; if yj < tj (“underestimating”), we add some h to v0 wO , thus m v0 wO closer3 to vwI . If yj is very close to tj, then according to the update equation little change will be made to the weights. Note, again, that vw (input vector) an (output vector) are two di↵erent vector representations of the word w. Update equation for input!hidden weights Having obtained the update equations for W0, we can now move on to W. We tak derivative of E on the output of the hidden layer, obtaining @E @hi = VX j=1 @E @uj · @uj @hi = VX j=1 ej · w0 ij := EHi where hi is the output of the i-th unit of the hidden layer; uj is defined in (2), th input of the j-th unit in the output layer; and ej = yj tj is the prediction error
  77. 77. Remarkable properties from Word embedding • Simple algebraic operations with the word vectors v(‘‘woman")−v(‘‘man") ≃ v(‘‘aunt")−v(‘‘uncle") v(‘‘woman")−v(‘‘man") ≃ v(‘‘queen")−v(‘‘king") Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. "Linguistic Regularities in Continuous Space Word Representations." HLT-NAACL. 2013. Figure 2: Left panel shows vector offsets for three word pairs illustrating the gender relation. Right panel shows a different projection, and the singular/plural relation for two words. In high-dimensional space, multiple relations can be embedded for a single word. provided. We have explored several related meth- ods and found that the proposed method performs well for both syntactic and semantic relations. We note that this measure is qualitatively similar to rela- tional similarity model of (Turney, 2012), which pre- dicts similarity between members of the word pairs (xb, xd), (xc, xd) and dis-similarity for (xa, xd). 6 Experimental Results To evaluate the vector offset method, we used vectors generated by the RNN toolkit of Mikolov M LS LS LS RN RN RN RN Tabl diffe M RN CW CW HL HL Tabl lobe Log Figure 2: Left panel shows vector offsets for three word pairs illustrating the gender relation. Right panel shows a different projection, and the singular/plural relation for two words. In high-dimensional space, multiple relations can be embedded for a single word. provided. We have explored several related meth- ods and found that the proposed method performs well for both syntactic and semantic relations. We note that this measure is qualitatively similar to rela- tional similarity model of (Turney, 2012), which pre- dicts similarity between members of the word pairs (xb, xd), (xc, xd) and dis-similarity for (xa, xd). 6 Experimental Results Vector offsets for gender relation The singular/plural relation for two words Word the relationship is defined by subtracting two word vectors, and the result is added to another word. Thus for example, Paris - France + Italy = Rome. Using X = v(”biggest”) − v(”big”) + v(”small”) as query and searching for the nearest word based on cosine distance results in v(”smallest”) Single-prototype English embeddings by Huang et al. (2012) are used to initialize Chinese em- beddings. The initialization readily provides a set (Align-Init) of benchmark embeddings in experi- ments (Section 4), and ensures translation equiva- lence in the embeddings at start of training. 3.2.2 Bilingual training Using the alignment counts, we form alignment matrices Aen→zh and Azh→en. For Aen→zh, each row corresponds to a Chinese word, and each col- umn an English word. An element aij is first as- signed the counts of when the ith Chinese word is aligned with the jth English word in parallel text. After assignments, each row is normalized such that it sums to one. The matrix Azh→en is defined sim- ilarly. Denote the set of Chinese word embeddings as Vzh, with each row a word embedding, and the set of English word embeddings as Ven. With the two alignment matrices, we define the Translation Equivalence Objective: JTEO-en→zh = ∥Vzh − Aen→zhVen∥2 (3) JTEO-zh→en = ∥Ven − Azh→enVzh∥2 (4) We optimize for a combined objective during train- ing. For the Chinese embeddings we optimize for: JCO-zh + λJTEO-en→zh (5) For the English embeddings we optimize for: JCO-en + λJTEO-zh→en (6) During bilingual training, we chose the value of λ such that convergence is achieved for both JCO and JTEO. A small validation set of word similarities 3.3 Curriculum training We train 100k-vocabulary word embeddings using curriculum training (Turian et al., 2010) with Equa- tion 5. For each curriculum, we sort the vocabu- lary by frequency and segment the vocabulary by a band-size taken from {5k, 10k, 25k, 50k}. Separate bands of the vocabulary are trained in parallel using minibatch L-BFGS on the Chinese Gigaword cor- pus 3. We train 100,000 iterations for each curricu- lum, and the entire 100k vocabulary is trained for 500,000 iterations. The process takes approximately 19 days on a eight-core machine. We show visual- ization of learned embeddings overlaid with English in Figure 1. The two-dimensional vectors for this vi- sualization is obtained with t-SNE (van der Maaten and Hinton, 2008). To make the figure comprehen- sible, subsets of Chinese words are provided with reference translations in boxes with green borders. Words across the two languages are positioned by the semantic relationships implied by their embed- dings. Figure 1: Overlaid bilingual embeddings: English words are plotted in yellow boxes, and Chinese words in green; reference translations to English are provided in boxesZou, Will Y., et al. "Bilingual Word Embeddings for Phrase-Based Machine Translation." EMNLP. 2013.
  78. 78. Neural Language models • n-gram model – Construct conditional probabilities for the next word, given combinations of the last n- 1 words ( contexts) • Neural language model – associate with each word a distributed word feature vector for word embedding, – express the joint probability function of word sequences using those vectors, and – learn simultaneously the word feature vectors and the parameters of that probability function. he training points (e.g., training sentences) is distributed in a larger volume, usually in some form eighborhood around the training points. In high dimensions, it is crucial to distribute probability s where it matters rather than uniformly in all directions around each training point. We will w in this paper that the way in which the approach proposed here generalizes is fundamentally erent from the way in which previous state-of-the-art statistical language modeling approaches generalizing. A statistical model of language can be represented by the conditional probability of the next d given all the previous ones, since ˆP(wT 1 ) = T ∏ t=1 ˆP(wt|wt 1 1 ), ere wt is the t-th word, and writing sub-sequence w j i = (wi,wi+1,··· ,wj 1,wj). Such statisti- language models have already been found useful in many technological applications involving ural language, such as speech recognition, language translation, and information retrieval. Im- vements in statistical language models could thus have a significant impact on such applications. When building statistical models of natural language, one considerably reduces the difficulty his modeling problem by taking advantage of word order, and the fact that temporally closer ds in the word sequence are statistically more dependent. Thus, n-gram models construct ta- of conditional probabilities for the next word, for each one of a large number of contexts, i.e. mbinations of the last n 1 words: ˆP(wt|wt 1 1 ) ⇡ ˆP(wt|wt 1 t n+1). only consider those combinations of successive words that actually occur in the training cor- or that occur frequently enough. What happens when a new combination of n words appears was not seen in the training corpus? We do not want to assign zero probability to such cases, ause such new combinations are likely to occur, and they will occur even more frequently for where 1 t=1 1 d, and writing sub-sequence w j i = (wi,wi+1,··· ,wj 1,wj). Such statisti- e already been found useful in many technological applications involving s speech recognition, language translation, and information retrieval. Im- anguage models could thus have a significant impact on such applications. tical models of natural language, one considerably reduces the difficulty m by taking advantage of word order, and the fact that temporally closer nce are statistically more dependent. Thus, n-gram models construct ta- bilities for the next word, for each one of a large number of contexts, i.e. n 1 words: ˆP(wt|wt 1 1 ) ⇡ ˆP(wt|wt 1 t n+1). combinations of successive words that actually occur in the training cor- ntly enough. What happens when a new combination of n words appears raining corpus? We do not want to assign zero probability to such cases, nations are likely to occur, and they will occur even more frequently for mple answer is to look at the probability predicted using a smaller context rigram models (Katz, 1987) or in smoothed (or interpolated) trigram mod- 1980). So, in such models, how is generalization basically obtained from in the training corpus to new sequences of words? A way to understand ink about a generative model corresponding to these interpolated or back- ntially, a new sequence of words is generated by “gluing” very short and gth 1, 2 ... or up to n words that have been seen frequently in the training ning the probability of the next piece are implicit in the particulars of the n-gram algorithm. Typically researchers have used n = 3, i.e. trigrams, -art results, but see Goodman (2001) for how combining many tricks can BENGIO, DUCHARME, VINCENT AND JAUVIN softmax tanh . . . . . .. . . . . . . . . . . . . . . across words most computation here index for index for index for shared parameters Matrix in look−up Table . . . C C wt 1wt 2 C(wt 2) C(wt 1)C(wt n+1) wt n+1 i-th output = P(wt = i | context) Figure 1: Neural architecture: f(i,wt 1,··· ,wt n+1) = g(i,C(wt 1),··· ,C(wt n+1 neural network and C(i) is the i-th word feature vector. parameters of the mapping C are simply the feature vectors themselves, represenBengio, Yoshua, et al. "Neural probabilistic language models." Innovations in Machine Learning. Springer Berlin Heidelberg, 2006. 137-186.

×