Machine Learning Preliminaries and Math Refresher

General remarks about learning
Probability Theory and Statistics
Linear spaces

Machine Learning
Preliminaries and Math Refresher

M. L¨thi, T. Vetter
u

February 18, 2008

u Machine Learning Preliminaries and Math Refresher

Linear spaces

Outline

1 General remarks about learning

2 Probability Theory and Statistics

3 Linear spaces


Linear spaces

The problem of learning is arguably at the very core of the problem
of intelligence, both biological and artiﬁcial.

T. Poggio and C.R. Shelton


Linear spaces

Model building in natural sciences

Model building
Given a phenomenon, construct a model for it.

Example (Heat Conduction)
Phenomenon: The spontaneous transfer of thermal energy
through matter, from a region of higher temperature to a region of
lower temperature
Model:
∂Q
= −k T · dS
∂t S


Linear spaces

Learning as Model Building

Example (Learning)
Phenomenon: Learning (Inferring general rules from examples)
Model:

P(f )P(f |D)
f ∗ = arg max
f ∈H P(D)


Linear spaces

Learning as Model Building

Example (Learning)
Phenomenon: Learning (Inferring general rules from examples)
Model:

∗ P(f )
)P(D|f
f = arg
max

f ∈H P(D)
Neural networks, Decision Trees, Naive Bayes, Support Vector
machines, etc.

Models for learning
The models for learning are the learning algorithms


Linear spaces

Goals of the ﬁrst block

Life is short . . .
We want to cover the essentials of learning.

General Setting Statistical Kernel Methods
Mathematically Learning Theory Theory of
precise setting When does Kernels
of the learning learning work Make linear
problem Conditions any algorithms
Valid for any algorithm has non-linear.
kind of learning to satisfy Learning from
algorithm Performance non-vectorial
bounds data.


Linear spaces

Mathematics needed in the ﬁrst block
The need for mathematics
As we treat the learning problem in a formal setting, the results
and methods are necessarily formulated in mathematical terms.

Probability Learning Theory Linear spaces
theory More Linear algebra
Statistics probability
Basic
theory
Basic optimization
optimization More statistics theory
theory


Linear spaces

Mathematics needed in the ﬁrst block
The need for mathematics
As we treat the learning problem in a formal setting, the results
and methods are necessarily formulated in mathematical terms.

Probability Learning Theory Linear spaces
theory More Linear algebra
Statistics probability
Basic
theory
Basic optimization
optimization More statistics theory
theory
A bit of mathematical maturity and an open mind is required. The
rest will be explained.

Linear spaces

Nothing is more practical than a good theory.

Vladimir N. Vapnik


Linear spaces

Nothing is more practical than a good theory.

Vladimir N. Vapnik

Nothing (in computer science) is more beautiful than learning
theory?

M. L¨thi
u


Linear spaces


Linear spaces

Probability theory vs Statistics

Definition (Probability Theory) Definition (Statistics)
A branch of mathematics The science of collecting,
concerned with the analysis of analyzing, presenting, and
random phenomena. interpreting data.
General ⇒ Specific Specific ⇒ General

Statistical Machine learning is closely related to (inferential)
statistics.
Many state-of-the-art learning algorithms are based on
concepts from probability theory.


Linear spaces

Probabilities

Deﬁnition (Probability Space)
A probability space is the triple

(Ω, F, P)

where
Ω is a set of events ω
F is a collection of events (e.g. the power-set P(Ω))
P is a measure that satisﬁes the probability axioms.


Linear spaces

Axioms of Probability

1 For any A ∈ F, there exists a number P(A), the probability of
A, satisﬁtying P(A) ≥ 0.
2 P(Ω) = 1.
3 Let {An , n ≥ 1} be a collection of pairwise disjoint events,
and let A be their union. Then
∞
P(A) = P(An ).
n=1


Linear spaces

Independence

Definition (Independence)
Two events, A and B, are independent iff the probability of their
intersection equals the product of the individual probabilities, i.e.

P(A ∩ B) = P(A) · P(B).

Definition (Conditional probability)
Given two events A and B, with P(B) 0, we define the
conditional probability for A given B, P(A|B), by the relation

P(A ∩ B)
P(A|B) = .
P(B)


Linear spaces

Random Variables

A single event is not that interesting.

Definition (Random Variable)
A random variable X is a function from the probability space to a
vector of real numbers

X : Ω → Rn .
Random variables are characterized by their distribution function F :
Definition (Probability Distribution Function)
Let X : Ω → R be a random variable. We define

FX (x) = P(X ≤ x) − ∞ x ∞.


Linear spaces

Probability density function

Deﬁnition (Probability density function)
The density function, is the function fX , with the property
x
FX (x) = fX (y ) dy , −∞ x ∞.
−∞


Linear spaces

Convergence

Deﬁnition (Convergence in Probability)
Let X1 , X2 , . . . be random variables. We say that Xn converges in
probability to the random variable X as n → ∞, iﬀ, for all ε 0,

P(|Xn − X | ε) → 0, as n → ∞.
p
We write Xn − X as n → ∞.
→


Linear spaces

Weak law of large numbers

Theorem (Bernoulli’s Theorems (Weak law of large numbers))
Let X1 , . . . , Xn be a sequence of independent and identically
distributed (i.i.d.) random variables, each having mean µ and
standard deviation σ. Then

P[|(X1 + . . . + Xn )/n − µ| ε] → 0

as n → ∞.
Thus given enough observations xi ∼ FX , the sample mean
x = n n xi will approach the true mean µ.
1
i=1


Linear spaces

Expectation

Definition (Expectation)
Let X be a random variable with probability density function fX ,
and g : R → R a function. We define the expectation
∞
E [g (X )] := g (x)fX (x) dx.
−∞

Definition (Sample mean)
Let a sample x = {x1 , x2 , . . . , xn } be given. We define the
(sample) mean to be
n
1
x= xi .
n
i=1


Linear spaces

Variance
Definition (Variance)
Let X be a random variable with density funciton fX . The variance
is given by

Var[X ] = E [(X − E [X ])2 ] = E [X 2 ] − (E [X ])2 .

The square root Var[X ] of the variance is referred to as the
standard deviation.

Definition (Sample Variance)
Let the sample x = {x1 , x2 , . . . , xn } with sample mean x be given.
We define the sample variance to be

1
s2 = (xi − x)2 .
n−1

Linear spaces

Notation

Assume F has a probability density function:

dF (x)
f (x) =
dx
Formally, we write:
f (x) dx = dF (x)

Example: Expectation
∞ ∞
E [g (X )] := g (x)f (x) dx. = g (x)dF (x)
−∞ −∞


Linear spaces

Vector Space
A set V together with two binary operations
1 vector addition + : V × V → V and
2 scalar multiplication · : R × V → V
is called a vector space over R, if it satisﬁes the following axioms:
1 ∀x, y ∈ V : x + y = y + x (commutativity)
2 ∀x, y ∈ V : x + (y + z) = (x + y ) + z (associativity)
3 ∃0 ∈ V , ∀x ∈ V : 0 + x = x (identity of vector addition)
4 ∃1 ∈ V , ∀x ∈ V : 1 · x = x (identity of vector multiplication)
5 ∀x ∈ V : ∃x ∈ V : x + (−x) = 0 (additive inverse element)
6 ∀α ∈ R, ∀x, y ∈ V : α · (x + y ) = α · x + α · y (distributivity)
7 ∀α, β ∈ R, ∀x ∈ V : (α + β) · x = α · x + β · x (distributivity)
8 ∀α, β ∈ R, ∀x ∈ V : α(β · x) = (αβ) · x

Linear spaces

Vector Space

More importantly for us, the deﬁnition implies:

x + y ∈ V, ∀x, y ∈ V
αx ∈ V , ∀α ∈ R, ∀x ∈ V

Subspace criterion
Let V be a vector space over R, and let W be a subset of V .
Then W is a subspace if and only if it satisﬁes the following 3
conditions:
1 0∈W
2 If x, y ∈ W then x + y ∈ W
3 If x ∈ W and α ∈ R then αx ∈ W


Linear spaces

Normed spaces

Deﬁnition (Normed vector space)
A normed vector space is a pair (V , · ) where V is a vector space
and · is the associated norm, satisfying the following properties
for all u, v ∈ V :
1 v ≥ 0 (positivity)
2 u + v ≤ u + v (triangle inequality)
3 αv = |α| v (positive scalability)
4 v = 0 ⇔ v = 0 (positive deﬁniteness)


Linear spaces

Definition (Inner product space)
An real inner product space is a pair (V , ·, · ), where V is a real
vector space and ·, · the associated inner product, satisfying the
following properties for all u, v , ∈ V
1 u, v = v , u (symmetry)
2 αu, v = α u, v , u, αv = α u, v
and
u + v , w = u, w + v , w , u, v + w = u, v + u, w ,
(bilinearity)
3 u, u ≥ 0 (positive definiteness)

Definition (Strict inner product space)
A inner product space is called strict if

u, u = 0 ⇔ u = 0

Linear spaces

Inner product space

The strict inner product
induces a norm: f 2 = f ,f .
is used to deﬁne distances and angles between elements.

Theorem (Cauchy Schwarz inequality)
For all vectors u and v of a real inner product space (V , ·, · ), the
following inequality holds:

| u, v | ≤ u v .


Linear spaces

If you’re not comfortable with any of the presented material, you
should take your favourite textbook and read it up within the next
two weeks.


Machine Learning Preliminaries and Math Refresher

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Machine Learning Preliminaries and Math Refresher

Similar to Machine Learning Preliminaries and Math Refresher (18)

More from butest

More from butest (20)

Machine Learning Preliminaries and Math Refresher