3. Properties of Normal Distribution
Ref. https://en.wikipedia.org/wiki/Normal_distribution
- Every normal distribution is a version of the N(0, 1) whose domain has been stretched by a factor σ (the
standard deviation) and then translated by µ (the mean value).
- Any linear combination of a fixed collection of normal deviates is a normal deviate.
- Of all probability distributions over the reals with a specified mean µ and variance σ2
, the normal distribution
N(µ, σ2
) is the one with maximum entropy.
- The independence between ˆμ and s can be employed to construct the so-called t-statistic:
- Inverting the distribution of this t-statistics will allow us to construct the confidence interval for μ.
4. Central Limit Theorem (CLT)
Ref. https://en.wikipedia.org/wiki/Central_limit_theorem
{X1
, …, Xn
}: Random sample of size n
a sequence of independent and identically distributed (i.i.d.) random variables drawn from
a distribution of expected value given by µ and finite variance given by σ2
.
5. Central Limit Theorem (CLT)
Ref. https://en.wikipedia.org/wiki/Illustration_of_the_central_limit_theorem
9. Gaussian Process Motivation: Non-linear Regression
Ref. https://thegradient.pub/gaussian-process-not-quite-for-dummies/
Traditional non-linear regression typically gives you one function
that it considers to fit these observations the best.
But what about the other ones that are also pretty good?
10. 2D Gaussian as 2 Samples
Ref. https://thegradient.pub/gaussian-process-not-quite-for-dummies/
16. Impact of Kernels on Prior Distributions
Ref. https://distill.pub/2019/visual-exploration-gaussian-processes/#Prior
17. Combination of Kernels
Ref. https://distill.pub/2019/visual-exploration-gaussian-processes/#KernelCombinations
18. Gaussian Process in Continuous Case
Ref. https://thegradient.pub/gaussian-process-not-quite-for-dummies/
19. Gaussian Processes as Single Layer Neural Networks
- If weight and bias parameters are taken to be i.i.d., post activations xj
1
, xj'
1
are
independent for j ≠ j'.
- As zi
1
(x) is a sum of i.i.d. terms, by CLT, it will be Gaussian distributed when the
network is infinitely wide.
- Therefore, any finite collection of {zi
1
(xα=1
), …, zi
1
(xα=k
)} will have a joint
multivariate Gaussian distribution, which is exactly the definition of Gaussian
process.
Ref. Radford M. Neal, Priors for Infinite Networks, University of Toronto, 1994
20. Gaussian Processes as Deep Neural Networks
- Constructing kernels equivalent to infinitely wide neural networks with two hidden
layers and nonlinearities
- Tamir Hazan et al., Steps toward deep kernel methods from infinite neural networks, arxiv 2015
- Dropout training in neural networks as approximate Bayesian inference in deep
Gaussian processes
- Yarin Gal et al., Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep
Learning, ICML 2016
- Exact equivalence of infinitely wide deep networks and Gaussian Processes
- Jaehoon Lee et al., Deep Neural Networks as Gaussian Processes, ICLR 2018
- Convergence towards Gaussian processes of Bayesian infinitely wide deep neural
networks
- Alexander G. de G. Matthews et al., Gaussian Process Behaviour in Wide Deep Neural Networks, ICLR
2018
- … and much more!
21. Next Steps
- Overparameterization obtains good test accuracy
- Chiyuan Zhang et al., Understanding Deep Learning Requires Rethinking Generalization, CVPR 2017
- Empirical properties of overfitted classifiers
- Mikhail Belkin et al., To Understand Deep Learning We Need to Understand Kernel Learning, ICML
2018
- Evolution of an ANN during training can be described by a kernel
- Arthur Jacot et al., Neural Tangent Kernels: Convergence and Generalization in Neural Networks,
NeurIPS 2018
- Efficient exact algorithm for computing the extension of NTK to CNN
- Sanjeev Arora et al., On Exact Computation with an Infinitely Wide Neural Net, NeurIPS 2019