1. General remarks about learning
Probability Theory and Statistics
Linear spaces
Machine Learning
Preliminaries and Math Refresher
M. L¨thi, T. Vetter
u
February 18, 2008
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
2. General remarks about learning
Probability Theory and Statistics
Linear spaces
Outline
1 General remarks about learning
2 Probability Theory and Statistics
3 Linear spaces
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
3. General remarks about learning
Probability Theory and Statistics
Linear spaces
Outline
1 General remarks about learning
2 Probability Theory and Statistics
3 Linear spaces
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
4. General remarks about learning
Probability Theory and Statistics
Linear spaces
The problem of learning is arguably at the very core of the problem
of intelligence, both biological and artificial.
T. Poggio and C.R. Shelton
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
5. General remarks about learning
Probability Theory and Statistics
Linear spaces
Model building in natural sciences
Model building
Given a phenomenon, construct a model for it.
Example (Heat Conduction)
Phenomenon: The spontaneous transfer of thermal energy
through matter, from a region of higher temperature to a region of
lower temperature
Model:
∂Q
= −k T · dS
∂t S
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
6. General remarks about learning
Probability Theory and Statistics
Linear spaces
Learning as Model Building
Example (Learning)
Phenomenon: Learning (Inferring general rules from examples)
Model:
P(f )P(f |D)
f ∗ = arg max
f ∈H P(D)
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
7. General remarks about learning
Probability Theory and Statistics
Linear spaces
Learning as Model Building
Example (Learning)
Phenomenon: Learning (Inferring general rules from examples)
Model:
∗ P(f )
)P(D|f
f = arg
max
f ∈H P(D)
Neural networks, Decision Trees, Naive Bayes, Support Vector
machines, etc.
Models for learning
The models for learning are the learning algorithms
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
8. General remarks about learning
Probability Theory and Statistics
Linear spaces
Goals of the first block
Life is short . . .
We want to cover the essentials of learning.
General Setting Statistical Kernel Methods
Mathematically Learning Theory Theory of
precise setting When does Kernels
of the learning learning work Make linear
problem Conditions any algorithms
Valid for any algorithm has non-linear.
kind of learning to satisfy Learning from
algorithm Performance non-vectorial
bounds data.
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
9. General remarks about learning
Probability Theory and Statistics
Linear spaces
Mathematics needed in the first block
The need for mathematics
As we treat the learning problem in a formal setting, the results
and methods are necessarily formulated in mathematical terms.
General Setting Statistical Kernel Methods
Probability Learning Theory Linear spaces
theory More Linear algebra
Statistics probability
Basic
theory
Basic optimization
optimization More statistics theory
theory
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
10. General remarks about learning
Probability Theory and Statistics
Linear spaces
Mathematics needed in the first block
The need for mathematics
As we treat the learning problem in a formal setting, the results
and methods are necessarily formulated in mathematical terms.
General Setting Statistical Kernel Methods
Probability Learning Theory Linear spaces
theory More Linear algebra
Statistics probability
Basic
theory
Basic optimization
optimization More statistics theory
theory
A bit of mathematical maturity and an open mind is required. The
rest will be explained.
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
11. General remarks about learning
Probability Theory and Statistics
Linear spaces
Nothing is more practical than a good theory.
Vladimir N. Vapnik
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
12. General remarks about learning
Probability Theory and Statistics
Linear spaces
Nothing is more practical than a good theory.
Vladimir N. Vapnik
Nothing (in computer science) is more beautiful than learning
theory?
M. L¨thi
u
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
13. General remarks about learning
Probability Theory and Statistics
Linear spaces
Outline
1 General remarks about learning
2 Probability Theory and Statistics
3 Linear spaces
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
14. General remarks about learning
Probability Theory and Statistics
Linear spaces
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
15. General remarks about learning
Probability Theory and Statistics
Linear spaces
Probability theory vs Statistics
Definition (Probability Theory) Definition (Statistics)
A branch of mathematics The science of collecting,
concerned with the analysis of analyzing, presenting, and
random phenomena. interpreting data.
General ⇒ Specific Specific ⇒ General
Statistical Machine learning is closely related to (inferential)
statistics.
Many state-of-the-art learning algorithms are based on
concepts from probability theory.
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
16. General remarks about learning
Probability Theory and Statistics
Linear spaces
Probabilities
Definition (Probability Space)
A probability space is the triple
(Ω, F, P)
where
Ω is a set of events ω
F is a collection of events (e.g. the power-set P(Ω))
P is a measure that satisfies the probability axioms.
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
17. General remarks about learning
Probability Theory and Statistics
Linear spaces
Axioms of Probability
1 For any A ∈ F, there exists a number P(A), the probability of
A, satisfitying P(A) ≥ 0.
2 P(Ω) = 1.
3 Let {An , n ≥ 1} be a collection of pairwise disjoint events,
and let A be their union. Then
∞
P(A) = P(An ).
n=1
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
18. General remarks about learning
Probability Theory and Statistics
Linear spaces
Independence
Definition (Independence)
Two events, A and B, are independent iff the probability of their
intersection equals the product of the individual probabilities, i.e.
P(A ∩ B) = P(A) · P(B).
Definition (Conditional probability)
Given two events A and B, with P(B) 0, we define the
conditional probability for A given B, P(A|B), by the relation
P(A ∩ B)
P(A|B) = .
P(B)
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
19. General remarks about learning
Probability Theory and Statistics
Linear spaces
Random Variables
A single event is not that interesting.
Definition (Random Variable)
A random variable X is a function from the probability space to a
vector of real numbers
X : Ω → Rn .
Random variables are characterized by their distribution function F :
Definition (Probability Distribution Function)
Let X : Ω → R be a random variable. We define
FX (x) = P(X ≤ x) − ∞ x ∞.
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
20. General remarks about learning
Probability Theory and Statistics
Linear spaces
Probability density function
Definition (Probability density function)
The density function, is the function fX , with the property
x
FX (x) = fX (y ) dy , −∞ x ∞.
−∞
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
21. General remarks about learning
Probability Theory and Statistics
Linear spaces
Convergence
Definition (Convergence in Probability)
Let X1 , X2 , . . . be random variables. We say that Xn converges in
probability to the random variable X as n → ∞, iff, for all ε 0,
P(|Xn − X | ε) → 0, as n → ∞.
p
We write Xn − X as n → ∞.
→
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
22. General remarks about learning
Probability Theory and Statistics
Linear spaces
Weak law of large numbers
Theorem (Bernoulli’s Theorems (Weak law of large numbers))
Let X1 , . . . , Xn be a sequence of independent and identically
distributed (i.i.d.) random variables, each having mean µ and
standard deviation σ. Then
P[|(X1 + . . . + Xn )/n − µ| ε] → 0
as n → ∞.
Thus given enough observations xi ∼ FX , the sample mean
x = n n xi will approach the true mean µ.
1
i=1
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
23. General remarks about learning
Probability Theory and Statistics
Linear spaces
Expectation
Definition (Expectation)
Let X be a random variable with probability density function fX ,
and g : R → R a function. We define the expectation
∞
E [g (X )] := g (x)fX (x) dx.
−∞
Definition (Sample mean)
Let a sample x = {x1 , x2 , . . . , xn } be given. We define the
(sample) mean to be
n
1
x= xi .
n
i=1
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
24. General remarks about learning
Probability Theory and Statistics
Linear spaces
Variance
Definition (Variance)
Let X be a random variable with density funciton fX . The variance
is given by
Var[X ] = E [(X − E [X ])2 ] = E [X 2 ] − (E [X ])2 .
The square root Var[X ] of the variance is referred to as the
standard deviation.
Definition (Sample Variance)
Let the sample x = {x1 , x2 , . . . , xn } with sample mean x be given.
We define the sample variance to be
1
s2 = (xi − x)2 .
n−1
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
25. General remarks about learning
Probability Theory and Statistics
Linear spaces
Notation
Assume F has a probability density function:
dF (x)
f (x) =
dx
Formally, we write:
f (x) dx = dF (x)
Example: Expectation
∞ ∞
E [g (X )] := g (x)f (x) dx. = g (x)dF (x)
−∞ −∞
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
26. General remarks about learning
Probability Theory and Statistics
Linear spaces
Outline
1 General remarks about learning
2 Probability Theory and Statistics
3 Linear spaces
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
27. General remarks about learning
Probability Theory and Statistics
Linear spaces
Vector Space
A set V together with two binary operations
1 vector addition + : V × V → V and
2 scalar multiplication · : R × V → V
is called a vector space over R, if it satisfies the following axioms:
1 ∀x, y ∈ V : x + y = y + x (commutativity)
2 ∀x, y ∈ V : x + (y + z) = (x + y ) + z (associativity)
3 ∃0 ∈ V , ∀x ∈ V : 0 + x = x (identity of vector addition)
4 ∃1 ∈ V , ∀x ∈ V : 1 · x = x (identity of vector multiplication)
5 ∀x ∈ V : ∃x ∈ V : x + (−x) = 0 (additive inverse element)
6 ∀α ∈ R, ∀x, y ∈ V : α · (x + y ) = α · x + α · y (distributivity)
7 ∀α, β ∈ R, ∀x ∈ V : (α + β) · x = α · x + β · x (distributivity)
8 ∀α, β ∈ R, ∀x ∈ V : α(β · x) = (αβ) · x
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
28. General remarks about learning
Probability Theory and Statistics
Linear spaces
Vector Space
More importantly for us, the definition implies:
x + y ∈ V, ∀x, y ∈ V
αx ∈ V , ∀α ∈ R, ∀x ∈ V
Subspace criterion
Let V be a vector space over R, and let W be a subset of V .
Then W is a subspace if and only if it satisfies the following 3
conditions:
1 0∈W
2 If x, y ∈ W then x + y ∈ W
3 If x ∈ W and α ∈ R then αx ∈ W
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
29. General remarks about learning
Probability Theory and Statistics
Linear spaces
Normed spaces
Definition (Normed vector space)
A normed vector space is a pair (V , · ) where V is a vector space
and · is the associated norm, satisfying the following properties
for all u, v ∈ V :
1 v ≥ 0 (positivity)
2 u + v ≤ u + v (triangle inequality)
3 αv = |α| v (positive scalability)
4 v = 0 ⇔ v = 0 (positive definiteness)
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
30. General remarks about learning
Probability Theory and Statistics
Linear spaces
Definition (Inner product space)
An real inner product space is a pair (V , ·, · ), where V is a real
vector space and ·, · the associated inner product, satisfying the
following properties for all u, v , ∈ V
1 u, v = v , u (symmetry)
2 αu, v = α u, v , u, αv = α u, v
and
u + v , w = u, w + v , w , u, v + w = u, v + u, w ,
(bilinearity)
3 u, u ≥ 0 (positive definiteness)
Definition (Strict inner product space)
A inner product space is called strict if
u, u = 0 ⇔ u = 0
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
31. General remarks about learning
Probability Theory and Statistics
Linear spaces
Inner product space
The strict inner product
induces a norm: f 2 = f ,f .
is used to define distances and angles between elements.
Theorem (Cauchy Schwarz inequality)
For all vectors u and v of a real inner product space (V , ·, · ), the
following inequality holds:
| u, v | ≤ u v .
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher
32. General remarks about learning
Probability Theory and Statistics
Linear spaces
If you’re not comfortable with any of the presented material, you
should take your favourite textbook and read it up within the next
two weeks.
M. L¨thi, T. Vetter
u Machine Learning Preliminaries and Math Refresher