2. What is Deep Learning?
Contents
1 What is Deep Learning?
2 History
Perceptron
Multilayer Perceptron
1st Breakthrough: Unsupervised Learning
2nd Breakthrough: Supervised Learning
3 Apply to Public Health
Epidemiology vs Machine Learning
Deep Learning vs Other ML
Hypothesis Testing vs Hypothesis Generating
4 Conclusion
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 2 / 74
3. What is Deep Learning?
Machine Learning
ôè0 YµXì !` ˆÄ] !¨(prediction)D
X” xõÀ¥X „|.
Computer science + Statistics ??
Amazon, Google, Facebook..
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 3 / 74
4. What is Deep Learning?
Neural Network
Human brain VS Computer
3431 3324 =??
@ à‘t lÄ, L1xÝ, 8xÝ
Sequential VS Parallel
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 4 / 74
7. cial neuron or hidden unity; (C) biological
synapse; (D) ANN synapses.
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 5 / 74
8. What is Deep Learning?
http://www.nd.com/welcome/whatisnn.htm
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 6 / 74
9. What is Deep Learning?
Deep Neural Network(DNN) ' Deep Learning
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 7 / 74
10. What is Deep Learning?
Œ IT0Å `0ÄYµ' Ñ http://www.dt.co.kr/contents.
html?article_no=2014062002010960718002
8Ä” À xõÀ¥ ô 6pìì è$X m@ `]'
http://vip.mk.co.kr/news/view/21/20/1178659.html
MS t|°Ü, `8àìÝ' tÄä
http://www.bloter.net/archives/196341
$t” 5 ü” 0 ü `%ìÝ'
http://www.wikitree.co.kr/main/news_view.php?id=157174
xõÀ¥ Ü lX èt¼ ¸ http://weekly.chosun.
com/client/news/viw.asp?nNewsNumb=002311100009ctcd=C02
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 8 / 74
11. History
Contents
1 What is Deep Learning?
2 History
Perceptron
Multilayer Perceptron
1st Breakthrough: Unsupervised Learning
2nd Breakthrough: Supervised Learning
3 Apply to Public Health
Epidemiology vs Machine Learning
Deep Learning vs Other ML
Hypothesis Testing vs Hypothesis Generating
4 Conclusion
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 9 / 74
12. History Perceptron
Perceptron
1958D Rosenblatt[23].
y = '(
Xn
i=1
wi xi + b) (1)
(b: bias, ': activation function(e.g: logistic or tanh))
Figure. Concept of Perceptron[Honkela]
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 10 / 74
13. History Perceptron
Low Performance
XORÄ t°XÀ »ä[Hinton].
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 11 / 74
16. History Multilayer Perceptron
Gradient Descent Methods
Weight / 4 Îä..
Linear regression: Least square, maximum likelihood: Exact
calculation.
MLP: No exact method
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 14 / 74
17. History Multilayer Perceptron
Gradient Descent Algorithm[Han-Hsing]
(a) Large Gradient (b) Small Gradient
(c) Small Learning Rate (d) Large Learning Rate
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 15 / 74
18. History Multilayer Perceptron
Example[Hinton]
A toy example to illustrate the iterative method
• Each day you get lunch at the cafeteria.
– Your diet consists of fish, chips, and ketchup.
– You get several portions of each.
• The cashier only tells you the total price of the meal
– After several days, you should be able to figure out the price of
each portion.
• The iterative approach: Start with random guesses for the prices and
then adjust them to get a better fit to the observed prices of whole
meals.
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 16 / 74
19. History Multilayer Perceptron
Solving the equations iteratively
• Each meal price gives a linear constraint on the prices of the
portions:
price = x fishwfish + xchipswchips + xketchupwketchup
• The prices of the portions are like the weights in of a linear neuron.
w = (wfish ,wchips ,wketchup )
• We will start with guesses for the weights and then adjust the
guesses slightly to give a better fit to the prices given by the cashier.
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 17 / 74
20. History Multilayer Perceptron
The true weights used by the cashier
Price of meal = 850 = target
150 50 100
2 5 3
portions
of fish
portions
of chips
portions of
ketchup
linear
neuron
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 18 / 74
21. History Multilayer Perceptron
A model of the cashier with arbitrary initial weights
• Residual error = 350
• The “delta-rule” for learning is:
Δwi =ε xi (t − y)
• With a learning rate of 1/35,
the weight changes are
+20, +50, +30
• This gives new weights of
70, 100, 80.
– Notice that the weight for
chips got worse!
price of meal = 500
50 50 50
2 5 3
portions
of fish
portions
of chips
portions of
ketchup
ε
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 19 / 74
22. History Multilayer Perceptron
Deriving the delta rule
• Define the error as the squared
residuals summed over all
training cases:
• Now differentiate to get error
derivatives for weights
• The batch delta rule changes
the weights in proportion to
their error derivatives summed
over all training cases
E = 1
2
Σ (tn
− yn )2
n∈training
∂E
∂wi
= 1
2
∂yn
∂wi
dEn
dyn
Σ
n
Σ n
(tn − yn )
= − xi
n
∂E
∂wi
Δwi = −ε
Σ n
(tn − yn )
= ε xi
n
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 20 / 74
23. History Multilayer Perceptron
Backpropagation Algorithm[Kim]
(e) Forward Propagation (f) Back Propagation
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 21 / 74
24. History Multilayer Perceptron
Limitations of MLP[Kim]
1 Vanishing gradient problem
2 Typically requires lots of labeled data
3 Over
25. tting problem: Given limited amounts of labeled data, training
via back-propagation does not work well
4 Get stuck in local minima (?)
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 22 / 74
26. History Multilayer Perceptron
Vanishing Gradient[2]
Figure. Sigmoid functions
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 23 / 74
27. History Multilayer Perceptron
Local Minima[Kim]
Figure. Global and Local Minima
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 24 / 74
28. History 1st Breakthrough: Unsupervised Learning
1st Breakthrough: Unsupervised Learning
2006D Restricted Boltzmann Machine, Deep Belief Network, Deep
Boltzmann Machine[25, 13]..
Figure. Description of Unsupervised Learning[Kim]
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 25 / 74
29. History 1st Breakthrough: Unsupervised Learning
Limitations of MLP[Kim]
1 Vanishing gradient problem
Solved by bottom-up layerwise unsupervised pre-training
2 Typically requires lots of labeled data
3 Over
30. tting problem: Given limited amounts of labeled data, training
via back-propagation does not work well
Solved by using lots of unlabeled data
4 Get stuck in local minima (?)
Unsupervised pre-training may help the network initialize with good
parameters
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 26 / 74
31. History 1st Breakthrough: Unsupervised Learning
Restricted Boltzmann Machine(RBM)
ÐÀ ®D] U`t ’ä
P(v; h) =
1
Z
expE(v;h)
(Z: Normalized Constant)
Figure. Diagram of a Restricted Boltzmann[Wikipedia]
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 27 / 74
32. History 1st Breakthrough: Unsupervised Learning
Energy Function
E(v; h) =
X
i
ai vi
X
j
bjhj
X
i
X
j
hjwi ;jvi = aTv bTh hTWv
(ai : oset of visible variable, bj : oset of hidden variable, wi ;j : weight
between vi and hj )
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 28 / 74
34. History 1st Breakthrough: Unsupervised Learning
Hebb's Law (Hebbian Learning Rule)
http://www.skewsme.com/behavior.htm
http://lesswrong.com/lw/71x/a_crash_course_in_the_
neuroscience_of_human/l
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 30 / 74
35. History 1st Breakthrough: Unsupervised Learning
Traing RBM
P(v) =
P
h P(v; h)| T X” v@ øLX weightäD lX” ƒ.
Gradient Ascent
X
logP(v) = log (
h
expE(v;h)
Z
)
X
= log (
h
expE(v;h)) logZ
X
= log (
h
expE(v;h)) log (
X
v;h
expE(v;h))
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 31 / 74
36. History 1st Breakthrough: Unsupervised Learning
@logP(v)
@
=
1
P
h expE(v;h)
X
h
expE(v;h) @E(v; h)
@
+
1
P
v;h expE(v;h)
X
v;h
expE(v;h) @E(v; h)
@
=
X
h
p(hjv)
@E(v; h)
@
+
X
v;h
p(h; v)
@E(v; h)
@
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 32 / 74
37. History 1st Breakthrough: Unsupervised Learning
P(vjh) =
Ym
i=1
P(vi jh)
P(hjv) =
Yn
j=1
P(hj jv)
p(hj = 1jv) =
bj +
Xm
i=1
wi ;jvi
!
p(vi = 1jh) =
0
@ai +
Xn
j=1
wi ;jhj
1
A
(: activation function)
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 33 / 74
38. History 1st Breakthrough: Unsupervised Learning
@logP(v)
@
=
X
h
p(hjv)
@E(v; h)
@
+
X
v;h
p(h; v)
@E(v; h)
@
À Gibbs sampler SamplingXì t°
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 34 / 74
39. History 1st Breakthrough: Unsupervised Learning
Figure. Contrastive Divergence(CD-k)[7]
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 35 / 74
40. History 1st Breakthrough: Unsupervised Learning
Deep Belief Network[11, 12, 1]
1 Multiple RBM
2 Phoneme ! Word ! Grammer, Sentence
3 GenerationÄ ¥!!!
http://www.cs.toronto.edu/~hinton/adi/index.htm
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 36 / 74
41. History 2nd Breakthrough: Supervised Learning
2nd Breakthrough: Supervised Learning
1 Vanishing gradient problem
Solved by a new non-linear activation :recti
42. ed linear unit (ReLU)
2 Typically requires lots of labeled data
Solved by big data crowd sourcing
3 Over
43. tting problem: Given limited amounts of labeled data, training
via back-propagation does not work well
Solved by a new regularization method : dropout, dropconnect, etc
4 Get stuck in local minima (?)
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 37 / 74
45. ed Linear Unit (ReLU)
Figure. The proposed non-linearity, ReLU, and the standard neural
network non-linearity, logistic[30]
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 38 / 74
48. History 2nd Breakthrough: Supervised Learning
Figure. Description of DropOut DropConnect[Wan]
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 41 / 74
49. History 2nd Breakthrough: Supervised Learning
Figure. Using the MNIST dataset, in a) Ability of Dropout and
DropConnect to prevent over
50. tting as the size of the 2 fully connected
layers increase. b) Varying the drop-rate in a 400-400 network shows near
optimal performance around the p = 0.5[28]
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 42 / 74
51. History 2nd Breakthrough: Supervised Learning
Local Minima Issue
High dimension and non-convex optimization
1 Local minimaäX t D·D·` ƒ
2 Local minima ' Global minima.
3 Î@ (ÐÐ (ÐÈä local minimat0” ´5ä.
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 43 / 74
52. History 2nd Breakthrough: Supervised Learning
Local minima are all similar, there are long
plateaus, it can take long to break symmetries.
Optimization is not the real problem when:
– dataset is large
– unit do not saturate too much
– normalization layer
31
ConvNets: today
Loss
parameter
Figure. Local minima when high dimension and non-convex optimization
[Ranzato]
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 44 / 74
59. ed linear unit (ReLU), DropOut, DropConnect ñX
¬ vanishing gradient8@ over
60. tting issue| t°Xì D Supervised
learningt ¥.
6 Local minima issue” High dimension non-convex optimizationД Ä ”
€„t DÈ|” õ.
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 48 / 74
61. Apply to Public Health
Contents
1 What is Deep Learning?
2 History
Perceptron
Multilayer Perceptron
1st Breakthrough: Unsupervised Learning
2nd Breakthrough: Supervised Learning
3 Apply to Public Health
Epidemiology vs Machine Learning
Deep Learning vs Other ML
Hypothesis Testing vs Hypothesis Generating
4 Conclusion
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 49 / 74
62. Apply to Public Health Epidemiology vs Machine Learning
Objective of statistics
1 ÀÝX U¥, Causal inference
µÄY Pearson: äX ÄT` …D Xì..
2 X¬°
µÄY R.A Fisher: ¥ 1¥t ‹@ DÌ Ý
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 50 / 74
63. Apply to Public Health Epidemiology vs Machine Learning
Statistics in Epidemiology
Causal inference: Ðxt 4Çx?
tt ˜” ¨t ñtä. xüÄ ”`.
è ¨ 8.
ŽÀX èÄ ”(Kilometer VS meter, centering issue)
72. Apply to Public Health Epidemiology vs Machine Learning
Example2: Cox proportional hazard model
Censored data„X .
http:
//www.theriac.org/DeskReference/viewDocument.php?id=188
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 55 / 74
73. Apply to Public Health Epidemiology vs Machine Learning
http://www.uni-kiel.de/psychologie/rexrepos/posts/
survivalCoxPH.html
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 56 / 74
74. Apply to Public Health Epidemiology vs Machine Learning
Assumptions
ln (t) = ln 0(t) +
86. Apply to Public Health Epidemiology vs Machine Learning
Hazard Ratio
t ¸Xä. Odd Ratio .
But, t Ît ä´ä.
Ýt õ¡t Ä°t ´5ä.
Conditional Logistic Regression..
PredictionÐÄ Cox| àÑ` D”” Æä.
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 58 / 74
87. Apply to Public Health Epidemiology vs Machine Learning
Alternatives
Yi : Time of event
Not censored
p(yi ji ; 2) = (22)1
2 expf
(yi i )2
22
g
Censored
p(yi ti ji ; 2) =
Z 1
ti
(22)1
2 expf
(yi i )2
22
g@yi = (
i ti
)
Ü„ìX CDF èˆ ! Ä°t }ä!!
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 59 / 74
88. Apply to Public Health Epidemiology vs Machine Learning
Example3: Correlation Structure
Correlation structure à$t|X˜?
1 Epidemiology: Important
89. X s.e ä. ! p-value ä.
2 Prediction model: Not important
91. Apply to Public Health Epidemiology vs Machine Learning
Figure. A representation of the tradeo between
exibility and interpretability,
using dierent statistical learning methods. In general, as the
exibility of a
method increases, its interpretability decreases[16]
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 61 / 74
93. Apply to Public Health Epidemiology vs Machine Learning
Human VS metahuman[4]
Ted Chiang : SF Œ$
TÀ xX(xõÀ¥)X UÄx Àݘ¬¥%.
Human science: TÀ xX ¸ ƒäD tX” ÄX .
TÀ xXX |8D ˆíX” ƒt human science..
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 63 / 74
94. Apply to Public Health Deep Learning vs Other ML
Deep Learning vs Other ML
Multiple Hidden Layer: High
exibility
Massive Parallel Computing
Programming language for GPU/parallel computing
CUDA(Compute Uni
95. ed Device Architecture), OpenCL[21, 26]
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 64 / 74
97. Apply to Public Health Deep Learning vs Other ML
Paper[18, 5]
Building High-level Features
Using Large Scale Unsupervised Learning
Quoc V. Le quocle@cs.stanford.edu
Marc’Aurelio Ranzato ranzato@google.com
Rajat Monga rajatmonga@google.com
Matthieu Devin mdevin@google.com
Kai Chen kaichen@google.com
Greg S. Corrado gcorrado@google.com
Jeff Dean jeff@google.com
Andrew Y. Ng ang@cs.stanford.edu
Abstract
We consider the problem of building high-level,
class-specific feature detectors from
only unlabeled data. For example, is it pos-sible
to learn a face detector using only unla-beled
images? To answer this, we train a 9-
layered locally connected sparse autoencoder
with pooling and local contrast normalization
on a large dataset of images (the model has
1 billion connections, the dataset has 10 mil-lion
200x200 pixel images downloaded from
the Internet). We train this network using
model parallelism and asynchronous SGD on
a cluster with 1,000 machines (16,000 cores)
for three days. Contrary to what appears to
be a widely-held intuition, our experimental
results reveal that it is possible to train a face
detector without having to label images as
containing a face or not. Control experiments
show that this feature detector is robust not
only to translation but also to scaling and
out-of-plane rotation. We also find that the
same network is sensitive to other high-level
concepts such as cat faces and human bod-ies.
Starting with these learned features, we
trained our network to obtain 15.8% accu-racy
in recognizing 22,000 object categories
from ImageNet, a leap of 70% relative im-provement
over the previous state-of-the-art.
Appearing in Proceedings of the 29 th International Confer-ence
on Machine Learning, Edinburgh, Scotland, UK, 2012.
Copyright 2012 by the author(s)/owner(s).
1. Introduction
The focus of this work is to build high-level, class-specific
feature detectors from unlabeled images. For
instance, we would like to understand if it is possible to
build a face detector from only unlabeled images. This
approach is inspired by the neuroscientific conjecture
that there exist highly class-specific neurons in the hu-man
brain, generally and informally known as “grand-mother
neurons.” The extent of class-specificity of
neurons in the brain is an area of active investigation,
but current experimental evidence suggests the possi-bility
that some neurons in the temporal cortex are
highly selective for object categories such as faces or
hands (Desimone et al., 1984), and perhaps even spe-cific
people (Quiroga et al., 2005).
Contemporary computer vision methodology typically
emphasizes the role of labeled data to obtain these
class-specific feature detectors. For example, to build
a face detector, one needs a large collection of images
labeled as containing faces, often with a bounding box
around the face. The need for large labeled sets poses
a significant challenge for problems where labeled data
are rare. Although approaches that make use of inex-pensive
unlabeled data are often preferred, they have
not been shown to work well for building high-level
features.
This work investigates the feasibility of building high-level
features from only unlabeled data. A positive
answer to this question will give rise to two significant
results. Practically, this provides an inexpensive way
to develop features from unlabeled data. But perhaps
more importantly, it answers an intriguing question as
to whether the specificity of the “grandmother neuron”
could possibly be learned from unlabeled data. Infor-mally,
this would suggest that it is at least in principle
possible that a baby learns to group faces into one class
Deep learning with COTS HPC systems
Adam Coates acoates@cs.stanford.edu
Brody Huval brodyh@stanford.edu
Tao Wang twangcat@stanford.edu
David J. Wu dwu4@cs.stanford.edu
Andrew Y. Ng ang@cs.stanford.edu
Stanford University Computer Science Dept., 353 Serra Mall, Stanford, CA 94305 USA
Bryan Catanzaro bcatanzaro@nvidia.com
NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050
Abstract
Scaling up deep learning algorithms has been
shown to lead to increased performance in
benchmark tasks and to enable discovery of
complex high-level features. Recent efforts
to train extremely large networks (with over
1 billion parameters) have relied on cloud-like
computing infrastructure and thousands
of CPU cores. In this paper, we present tech-nical
details and results from our own sys-tem
based on Commodity Off-The-Shelf High
Performance Computing (COTS HPC) tech-nology:
a cluster of GPU servers with Infini-band
interconnects and MPI. Our system is
able to train 1 billion parameter networks on
just 3 machines in a couple of days, and we
show that it can scale to networks with over
11 billion parameters using just 16 machines.
As this infrastructure is much more easily
marshaled by others, the approach enables
much wider-spread research with extremely
large neural networks.
1. Introduction
A significant amount of effort has been put into de-veloping
deep learning systems that can scale to very
large models and large training sets. With each leap
in scale new results proliferate: large models in the
literature are now top performers in supervised vi-sual
recognition tasks (Krizhevsky et al., 2012; Cire-san
et al., 2012; Le et al., 2012), and can even learn
Proceedings of the 30 th International Conference on Ma-chine
Learning, Atlanta, Georgia, USA, 2013. JMLR:
WCP volume 28. Copyright 2013 by the author(s).
to detect objects when trained from unlabeled im-ages
alone (Coates et al., 2012; Le et al., 2012). The
very largest of these systems has been constructed by
Le et al. (Le et al., 2012) and Dean et al. (Dean et al.,
2012), which is able to train neural networks with over
1 billion trainable parameters. While such extremely
large networks are potentially valuable objects of AI
research, the expense to train them is overwhelming:
the distributed computing infrastructure (known as
“DistBelief”) used for the experiments in (Le et al.,
2012) manages to train a neural network using 16000
CPU cores (in 1000 machines) in just a few days, yet
this level of resource is likely beyond those available
to most deep learning researchers. Less clear still is
how to continue scaling significantly beyond this size
of network. In this paper we present an alternative
approach to training such networks that leverages in-expensive
computing power in the form of GPUs and
introduces the use of high-speed communications in-frastructure
to tightly coordinate distributed gradient
computations. Our system trains neural networks at
scales comparable to DistBelief with just 3 machines.
We demonstrate the ability to train a network with
more than 11 billion parameters—6.5 times larger than
the model in (Dean et al., 2012)—in only a few days
with 2% as many machines.
Buoyed by many empirical successes (Uetz Behnke,
2009; Raina et al., 2009; Ciresan et al., 2012;
Krizhevsky, 2010; Coates et al., 2011) much deep
learning research has focused on the goal of building
larger models with more parameters. Though some
techniques (such as locally connected networks (Le-
Cun et al., 1989; Raina et al., 2009; Krizhevsky, 2010),
and improved optimizers (Martens, 2010; Le et al.,
2011)) have enabled scaling by algorithmic advan-tage,
another main approach has been to achieve scale
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 66 / 74
98. Apply to Public Health Hypothesis Testing vs Hypothesis Generating
Hypothesis Testing vs Hypothesis Generating
Figure. Hypothesis-testing and Hypothesis-generating paradigms[3]
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 67 / 74
99. Apply to Public Health Hypothesis Testing vs Hypothesis Generating
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 68 / 74
100. Apply to Public Health Hypothesis Testing vs Hypothesis Generating
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 69 / 74
101. Apply to Public Health Hypothesis Testing vs Hypothesis Generating
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 70 / 74
102. Conclusion
Contents
1 What is Deep Learning?
2 History
Perceptron
Multilayer Perceptron
1st Breakthrough: Unsupervised Learning
2nd Breakthrough: Supervised Learning
3 Apply to Public Health
Epidemiology vs Machine Learning
Deep Learning vs Other ML
Hypothesis Testing vs Hypothesis Generating
4 Conclusion
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 71 / 74
103. Conclusion
Conclusion
Deep Learningt Mobile Health X uì.
Mobile data: Á, L1, M¤¸ ñ D pt0.
Parallel Computing System l•t D”Xä.
Prediction vs Inference
Understanding concept of Machine Learning
Hypothesis Generating
Paradigm shift: Causal inference ! Big data Prediction
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 72 / 74
104. Conclusion
Reference I
[1] Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends
R in Machine Learning, 2(1):1{127.
[2] Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is dicult. Neural
Networks, IEEE Transactions on, 5(2):157{166.
[3] Biesecker, L. G. (2013). Hypothesis-generating research and predictive medicine. Genome research, 23(7):1051{1053.
[4] Chiang, T. (2000). Catching crumbs from the table. Nature, 405(6786):517{517.
[5] Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. (2013). Deep learning with cots hpc systems. In
Proceedings of The 30th International Conference on Machine Learning, pages 1337{1345.
[documentation] documentation, D. . Convolutional neural networks (lenet).
http://deeplearning.net/tutorial/lenet.html.
[7] Fischer, A. and Igel, C. (2012). An introduction to restricted boltzmann machines. In Progress in Pattern Recognition,
Image Analysis, Computer Vision, and Applications, pages 14{36. Springer.
[8] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse recti
105. er networks. In Proceedings of the 14th International
Conference on Arti
106. cial Intelligence and Statistics. JMLR WCP Volume, volume 15, pages 315{323.
[Han-Hsing] Han-Hsing, T. [ml, python] gradient descent algorithm (revision 2).
http://hhtucode.blogspot.kr/2013/04/ml-gradient-descent-algorithm.html.
[Hinton] Hinton, G. Coursera: Neural networks for machine learning. https://class.coursera.org/neuralnets-2012-001.
[11] Hinton, G., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural computation,
18(7):1527{1554.
[12] Hinton, G. E. (2009). Deep belief networks. Scholarpedia, 4(5):5947.
[13] Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science,
313(5786):504{507.
[14] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving neural networks by
preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 73 / 74
107. Conclusion
Reference II
[Honkela] Honkela, A. Multilayer perceptrons. https://www.hiit.fi/u/ahonkela/dippa/node41.html.
[16] James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introduction to statistical learning. Springer.
[Kim] Kim, J. 2014 (4xÝ 0ÄYµ ì„YP. http://prml.yonsei.ac.kr/.
[18] Le, Q. V. (2013). Building high-level features using large scale unsupervised learning. In Acoustics, Speech and Signal
Processing (ICASSP), 2013 IEEE International Conference on, pages 8595{8598. IEEE.
[19] Maltarollo, V. G., Honorio, K. M., and da Silva, A. B. F. (2013). Applications of arti
108. cial neural networks in chemical
problems.
[20] Nair, V. and Hinton, G. E. (2010). Recti
109. ed linear units improve restricted boltzmann machines. In Proceedings of the
27th International Conference on Machine Learning (ICML-10), pages 807{814.
[21] Nvidia, C. (2007). Compute uni
110. ed device architecture programming guide.
[Ranzato] Ranzato, M. Deep learning for vision: Tricks of the trade. www.cs.toronto.edu/~ranzato.
[23] Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain.
Psychological review, 65(6):386.
[24] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1985). Learning internal representations by error propagation.
Technical report, DTIC Document.
[25] Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory.
[26] Stone, J. E., Gohara, D., and Shi, G. (2010). Opencl: A parallel programming standard for heterogeneous computing
systems. Computing in science engineering, 12(3):66.
[Wan] Wan, L. Regularization of neural networks using dropconnect. http://cs.nyu.edu/~wanli/dropc/.
[28] Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., and Fergus, R. (2013). Regularization of neural networks using dropconnect. In
Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1058{1066.
[Wikipedia] Wikipedia. Wikepedia. http://en.wikipedia.org/wiki/Restricted_Boltzmann_machine.
[30] Zeiler, M. D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q. V., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J.,
et al. (2013). On recti
111. ed linear units for speech processing. In Acoustics, Speech and Signal Processing (ICASSP), 2013
IEEE International Conference on, pages 3517{3521. IEEE.
@Ä- ( ´íY) %ìÝ(Deep Learning) September 10, 2014 74 / 74