The document discusses the principle of maximum entropy. It explains that maximum entropy is an approach for making probability assignments where the assigned probability distribution should have the largest entropy or uncertainty possible, subject to whatever information is known. It describes applications of maximum entropy modeling such as part-of-speech tagging and logistic regression. Maximum entropy and maximum likelihood methods are related as they both aim to make distributions as uniform as possible based on available information.
2. Outline
What is Entropy
Principle of Maximum Entropy
Relation
to Maximum Likelihood
MaxEnt methods and Bayesian
Applications
NLP(POS tagging)
Logistic regression
Q&A
3. What is Entropy
In information theory, entropy is the measure of the
amount of information that is missing before reception
and is sometimes referred to as Shannon entropy.
Uncertainty
4. Principle of Maximum Entropy
Subject to precisely stated prior data, which must be a
proposition that expresses testable information, the
probability distribution which best represents the
current state of knowledge is the one with largest
information theoretical entropy.
Why maximum entropy?
Minimize commitment
Model all that is known and assume nothing about what is unknown
5. Principle of Maximum Entropy
Overview
Should guarantee the uniqueness and consistency of
probability assignments obtained by different methods
Makes explicit our freedom in using different forms of
prior data
Admits the most ignorance beyond the stated prior data
6. Principle of Maximum Entropy
Testable information
The principle of maximum entropy is useful explicitly
only when applied to testable information
A piece of information is testable if it can be determined
whether a give distribution is consistent with it.
An example:
The expectation of the variable x is 2.87
and
p2 + p3 > 0.6
7. Principle of Maximum Entropy
General solution
Entropy maximization with no testable information
Given testable information
Seek the probability distribution which maximizes information
entropy, subject to the constraints of the information.
A constrained optimization problem. It can be typically solved
using the method of Lagrange Multipliers.
8. Principle of Maximum Entropy
General solution
Question
Seek the probability distribution which maximizes information
entropy, subject to some linear constraints.
Mathematical problem
Optimization Problem
non-linear programming with linear constraints
Idea
non-linear non-linear get result
programming programming
with linear with no
constraints constraints
• Lagrange • Let
• partial
multipliers derivative
differential
equals to 0
9. Principle of Maximum Entropy
General solution
Constraints
Some testable information I about a quantity x taking values in
{x1, x2,..., xn}. Express this information as m constraints on the
expectations of the functions fk, that is, we require our
probability distribution to satisfy
Furthermore, the probabilities must sum to one, giving the
constraint
Objective function
10. Principle of Maximum Entropy
General solution
The probability distribution with maximum information
entropy subject to these constraints is
The normalization constant is determined by
The λk parameters are Lagrange multipliers whose
particular values are determined by the constraints
according to
These m simultaneous equations do not generally possess a closed form
solution, and are usually solved by numerical methods.
11. Principle of Maximum Entropy
Training Model
Generalized Iterative Scaling (GIS) (Darroch and
Ratcliff, 1972)
Improved Iterative Scaling (IIS) (Della Pietra et al.,
1995)
12. Principle of Maximum Entropy
Training Model
Generalized Iterative Scaling (GIS) (Darroch and
Ratcliff, 1972)
Compute dj, j=1, …, k+1
Initialize (any values, e.g., 0)
Repeat until converge
• For each j
– Compute
– Compute
– Update
13. Principle of Maximum Entropy
Training Model
Generalized Iterative Scaling (GIS) (Darroch and
Ratcliff, 1972)
The running time of each iteration is O(NPA):
• N: the training set size
• P: the number of classes
• A: the average number of features that are active for a given
event (a, b).
14. Principle of Maximum Entropy
Relation to Maximum Likelihood
Likelihood function
P(x) is the distribution of estimation
is the empirical distribution
Log-Likelihood function
15. Principle of Maximum Entropy
Relation to Maximum Likelihood
Theorem
The model p*C with maximum entropy is the model in the
parametric family p(y|x) that maximizes the likelihood of the
training sample.
Coincidence?
Entropy – the measure of uncertainty
Likelihood – the degree of identical to knowledge
Maximum entropy - assume nothing about what is unknown
Maximum likelihood – impartially understand the knowledge
Knowledge = complementary set of uncertainty
16. Principle of Maximum Entropy
MaxEnt methods and Bayesian
Bayesian methods
p(H|DI) = p(H|I)p(D|HI) / p(D|I)
H stands for some hypothesis whose truth we want to judge
D for a set of data
I for prior information
Difference
A single application of Bayes’ theorem gives us only a
probability, not a probability distribution
MaxEnt gives us necessarily a probability distribution, not just a
probability.
17. Principle of Maximum Entropy
MaxEnt methods and Bayesian
Difference (continue)
Bayes’ theorem cannot determine the numerical value of any
probability directly from our information. To apply it one must first
use some other principle to translae information into numerical
values for p(H|I), p(D|HI), p(D|I)
MaxEnt does not require for input the numerical values of any
probabilities on the hypothesis space.
In common
The updating of a state of knowledge
Bayes’ rule and MaxEnt are completely compatible and can be
seen as special cases of the method of MaxEnt. (Giffin et al.
2007)
19. Applications
Maximum Entropy Model
POS Tagging
Tagging with MaxEnt Model
The conditional probability of a tag sequence t1,…, tn is
given a sentence w1,…, wn and contexts C1,…, Cn
Model Estimation
• The model should reflect the data
– use the data to constrain the model
• What form should the constraints take?
– constrain the expected value of each feature
20. Applications
Maximum Entropy Model
POS Tagging
The Constraints
• Expected value of each feature must satisfy some constraint Ki
• A natural choice for Ki is the average empirical count
• derived from the training data (C1, t1), (C2, t2)…, (Cn, tn)
21. Applications
Maximum Entropy Model
POS Tagging
MaxEnt Model
• The constraints do not uniquely identify a model
• The maximum entropy model is the most uniform model
– makes no assumptions in addition to what we know from the data
• Set the weights to give the MaxEnt model satisfying the constraints
– use Generalised Iterative Scaling (GIS)
Smoothing
• empirical counts for low frequency features can be unreliable
• Common smoothing technique is to ignore low frequency features
• Use a prior distribution on the parameters
22. Applications
Maximum Entropy Model
Logistic regression
Classification
• Linear regression for classification
• The problems of linear regression for classification
23. Applications
Maximum Entropy Model
Logistic regression
Hypothesis representation
• What function is used to represent our hypothesis in classification
• We want our classifier to output values between 0 and 1
• When using linear regression we did hθ(x) = (θT x)
• For classification hypothesis representation we do
hθ(x) = g((θT x))
Where we define g(z), z is a real number
g(z) = 1/(1 + e-z)
This is the sigmoid function, or the logistic function
24. Applications
Maximum Entropy Model
Logistic regression
Cost function for logistic regression
• Hypothesis representation
• Linear regression uses the following function to determine θ
• Define cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2
• Redefine J(Θ)
• J(Θ) does not work for logistic regression, since it’s a non-convex function
26. Applications
Maximum Entropy Model
Logistic regression
Simplified cost function
• For binary classification problems y is always 0 or 1
• So we can write cost function is
cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
• So, in summary, our cost function for the θ parameters can be defined as
• Find parameters θ which minimize J(θ)
27. Applications
Maximum Entropy Model
Logistic regression
How to minimize the logistic regression cost function
Use gradient descent to minimize J(θ)
28. Applications
Maximum Entropy Model
Logistic regression
Advanced optimization
• Good for large machine learning problems (e.g. huge feature set)
• What is gradient descent actually doing?
– compute J(θ) and the derivatives
– plug these values into gradient descent
• Alternatively, instead of gradient descent to minimize the cost function we
could use
– Conjugate gradient
– BFGS (Broyden-Fletcher-Goldfarb-Shanno)
– L-BFGS (Limited memory - BFGS)
29. Applications
Maximum Entropy Model
Logistic regression
Why do we chose this function when other cost functions exist?
• This cost function can be derived from statistics using the principle
of maximum likelihood estimation
– Note this does mean there's an underlying Gaussian assumption
relating to the distribution of features
• Also has the nice property that it's convex
31. Reference
Jaynes, E. T., 1988, 'The Relation of Bayesian and Maximum Entropy Methods',
in Maximum-Entropy and Bayesian Methods in Science and Engineering (Vol. 1),
Kluwer Academic Publishers, p. 25-26.
https://www.coursera.org/course/ml
The elements of statistical learning, 4.4.
Kitamura, Y., 2006, Empirical Likelihood Methods in Econometrics: Theory and
Practice, Cowles Foundation Discussion Papers 1569, Cowles Foundation, Yale
University
http://en.wikipedia.org/wiki/Principle_of_maximum_entropy
Lazar, N., 2003, "Bayesian Empirical Likelihood", Biometrika, 90, 319-326.
Giffin, A. and Caticha, A., 2007, Updating Probabilities with Data and Moments
Guiasu, S. and Shenitzer, A., 1985, 'The principle of maximum entropy', The
Mathematical Intelligencer, 7(1), 42-48.
Harremoës P. and Topsøe F., 2001, Maximum Entropy Fundamentals, Entropy,
3(3), 191-226.
http://en.wikipedia.org/wiki/Logistic_regression