SlideShare a Scribd company logo
1 of 31
Download to read offline
Principle of Maximum Entropy

                Jiawang Liu
         liujiawang@baidu.com
                   2012.6
Outline


 What is Entropy
 Principle of Maximum Entropy
   Relation
           to Maximum Likelihood
   MaxEnt methods and Bayesian

 Applications
   NLP(POS    tagging)
   Logistic regression

 Q&A
What is Entropy


  In information theory, entropy is the measure of the
  amount of information that is missing before reception
  and is sometimes referred to as Shannon entropy.




 Uncertainty
Principle of Maximum Entropy


   Subject to precisely stated prior data, which must be a
   proposition that expresses testable information, the
   probability distribution which best represents the
   current state of knowledge is the one with largest
   information theoretical entropy.

Why maximum entropy?
 Minimize commitment
 Model all that is known and assume nothing about what is unknown
Principle of Maximum Entropy


Overview

 Should guarantee the uniqueness and consistency of
  probability assignments obtained by different methods
 Makes explicit our freedom in using different forms of
  prior data
 Admits the most ignorance beyond the stated prior data
Principle of Maximum Entropy


Testable information
 The principle of maximum entropy is useful explicitly
  only when applied to testable information
 A piece of information is testable if it can be determined
  whether a give distribution is consistent with it.
 An example:

      The expectation of the variable x is 2.87
    and
      p2 + p3 > 0.6
Principle of Maximum Entropy


General solution
 Entropy maximization with no testable information



 Given testable information
      Seek the probability distribution which maximizes information
       entropy, subject to the constraints of the information.
      A constrained optimization problem. It can be typically solved
       using the method of Lagrange Multipliers.
Principle of Maximum Entropy


General solution
 Question
   Seek the probability distribution which maximizes information
    entropy, subject to some linear constraints.
 Mathematical problem
   Optimization Problem
   non-linear programming with linear constraints
 Idea
   non-linear            non-linear         get result
   programming           programming
   with linear           with no
   constraints           constraints

         • Lagrange                             • Let
                           • partial
           multipliers                            derivative
                             differential
                                                  equals to 0
Principle of Maximum Entropy


General solution
 Constraints
   Some testable information I about a quantity x taking values in
    {x1, x2,..., xn}. Express this information as m constraints on the
    expectations of the functions fk, that is, we require our
    probability distribution to satisfy



   Furthermore, the probabilities must sum to one, giving the
    constraint
 Objective function
Principle of Maximum Entropy


General solution
 The probability distribution with maximum information
  entropy subject to these constraints is

 The normalization constant is determined by

 The λk parameters are Lagrange multipliers whose
  particular values are determined by the constraints
  according to
      These m simultaneous equations do not generally possess a closed form
       solution, and are usually solved by numerical methods.
Principle of Maximum Entropy


Training Model

 Generalized Iterative Scaling (GIS) (Darroch and
  Ratcliff, 1972)

 Improved Iterative Scaling (IIS) (Della Pietra et al.,
  1995)
Principle of Maximum Entropy


Training Model
 Generalized Iterative Scaling (GIS) (Darroch and
  Ratcliff, 1972)
    Compute dj, j=1, …, k+1
    Initialize      (any values, e.g., 0)
    Repeat until converge
       • For each j
              – Compute

             – Compute

             – Update
Principle of Maximum Entropy


Training Model
 Generalized Iterative Scaling (GIS) (Darroch and
  Ratcliff, 1972)
   The running time of each iteration is O(NPA):
      • N: the training set size
      • P: the number of classes
      • A: the average number of features that are active for a given
        event (a, b).
Principle of Maximum Entropy


Relation to Maximum Likelihood
 Likelihood function



      P(x) is the distribution of estimation
           is the empirical distribution


 Log-Likelihood function
Principle of Maximum Entropy


Relation to Maximum Likelihood
 Theorem
     The model p*C with maximum entropy is the model in the
      parametric family p(y|x) that maximizes the likelihood of the
      training sample.
 Coincidence?
   Entropy – the measure of uncertainty
   Likelihood – the degree of identical to knowledge
   Maximum entropy - assume nothing about what is unknown
   Maximum likelihood – impartially understand the knowledge
  Knowledge = complementary set of uncertainty
Principle of Maximum Entropy


MaxEnt methods and Bayesian
 Bayesian methods
                       p(H|DI) = p(H|I)p(D|HI) / p(D|I)
     H stands for some hypothesis whose truth we want to judge
     D for a set of data
     I for prior information
 Difference
     A single application of Bayes’ theorem gives us only a
      probability, not a probability distribution
     MaxEnt gives us necessarily a probability distribution, not just a
      probability.
Principle of Maximum Entropy


MaxEnt methods and Bayesian
 Difference (continue)
      Bayes’ theorem cannot determine the numerical value of any
       probability directly from our information. To apply it one must first
       use some other principle to translae information into numerical
       values for p(H|I), p(D|HI), p(D|I)
      MaxEnt does not require for input the numerical values of any
       probabilities on the hypothesis space.
 In common
      The updating of a state of knowledge
      Bayes’ rule and MaxEnt are completely compatible and can be
       seen as special cases of the method of MaxEnt. (Giffin et al.
       2007)
Applications


Maximum Entropy Model
 NLP: POS Tagging, Parsing, PP attachment, Text
  Classification, LM, …
 POS Tagging
      Features



      Model
Applications


Maximum Entropy Model
 POS Tagging
     Tagging with MaxEnt Model
       The conditional probability of a tag sequence t1,…, tn is




       given a sentence w1,…, wn and contexts C1,…, Cn
     Model Estimation

       •   The model should reflect the data
             – use the data to constrain the model
       •   What form should the constraints take?
             – constrain the expected value of each feature
Applications


Maximum Entropy Model
 POS Tagging
     The Constraints
       •   Expected value of each feature must satisfy some constraint Ki




       •   A natural choice for Ki is the average empirical count




       •   derived from the training data (C1, t1), (C2, t2)…, (Cn, tn)
Applications


Maximum Entropy Model
 POS Tagging
     MaxEnt Model
       •   The constraints do not uniquely identify a model
       •   The maximum entropy model is the most uniform model
             – makes no assumptions in addition to what we know from the data
       •   Set the weights to give the MaxEnt model satisfying the constraints
             – use Generalised Iterative Scaling (GIS)
     Smoothing
       •   empirical counts for low frequency features can be unreliable
       •   Common smoothing technique is to ignore low frequency features
       •   Use a prior distribution on the parameters
Applications


Maximum Entropy Model
 Logistic regression
      Classification
        •   Linear regression for classification




        •   The problems of linear regression for classification
Applications


Maximum Entropy Model
 Logistic regression
      Hypothesis representation
        •   What function is used to represent our hypothesis in classification
        •   We want our classifier to output values between 0 and 1
        •   When using linear regression we did hθ(x) = (θT x)
        •   For classification hypothesis representation we do
                                           hθ(x) = g((θT x))
              Where we define g(z), z is a real number
                                          g(z) = 1/(1 + e-z)
                                       This is the sigmoid function, or the logistic function
Applications


Maximum Entropy Model
 Logistic regression
      Cost function for logistic regression
        •   Hypothesis representation



        •   Linear regression uses the following function to determine θ



        •   Define cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2
        •   Redefine J(Θ)

        •   J(Θ) does not work for logistic regression, since it’s a non-convex function
Applications


Maximum Entropy Model
 Logistic regression
      Cost function for logistic regression
        •   A convex logistic regression cost function
Applications


Maximum Entropy Model
 Logistic regression
      Simplified cost function
        •   For binary classification problems y is always 0 or 1
        •   So we can write cost function is
                        cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
        •   So, in summary, our cost function for the θ parameters can be defined as




        •   Find parameters θ which minimize J(θ)
Applications


Maximum Entropy Model
 Logistic regression
      How to minimize the logistic regression cost function



      Use gradient descent to minimize J(θ)
Applications


Maximum Entropy Model
 Logistic regression
      Advanced optimization
        •   Good for large machine learning problems (e.g. huge feature set)
        •   What is gradient descent actually doing?
               – compute J(θ) and the derivatives
               – plug these values into gradient descent
        •   Alternatively, instead of gradient descent to minimize the cost function we
            could use
               – Conjugate gradient
               – BFGS (Broyden-Fletcher-Goldfarb-Shanno)
               – L-BFGS (Limited memory - BFGS)
Applications


Maximum Entropy Model
 Logistic regression
      Why do we chose this function when other cost functions exist?
        •   This cost function can be derived from statistics using the principle
            of maximum likelihood estimation
               – Note this does mean there's an underlying Gaussian assumption
                 relating to the distribution of features
        •   Also has the nice property that it's convex
Q&A


      Thanks!
Reference

   Jaynes, E. T., 1988, 'The Relation of Bayesian and Maximum Entropy Methods',
    in Maximum-Entropy and Bayesian Methods in Science and Engineering (Vol. 1),
    Kluwer Academic Publishers, p. 25-26.
   https://www.coursera.org/course/ml
   The elements of statistical learning, 4.4.
   Kitamura, Y., 2006, Empirical Likelihood Methods in Econometrics: Theory and
    Practice, Cowles Foundation Discussion Papers 1569, Cowles Foundation, Yale
    University
   http://en.wikipedia.org/wiki/Principle_of_maximum_entropy
   Lazar, N., 2003, "Bayesian Empirical Likelihood", Biometrika, 90, 319-326.
   Giffin, A. and Caticha, A., 2007, Updating Probabilities with Data and Moments
   Guiasu, S. and Shenitzer, A., 1985, 'The principle of maximum entropy', The
    Mathematical Intelligencer, 7(1), 42-48.
   Harremoës P. and Topsøe F., 2001, Maximum Entropy Fundamentals, Entropy,
    3(3), 191-226.
   http://en.wikipedia.org/wiki/Logistic_regression

More Related Content

What's hot

random forest regression
random forest regressionrandom forest regression
random forest regressionAkhilesh Joshi
 
Linear regression in machine learning
Linear regression in machine learningLinear regression in machine learning
Linear regression in machine learningShajun Nisha
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanPeerasak C.
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 
Dempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDigiGurukul
 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysisSoujanya V
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)SungminYou
 
A Maximum Flow Min cut theorem for Optimizing Network
A Maximum Flow Min cut theorem for Optimizing NetworkA Maximum Flow Min cut theorem for Optimizing Network
A Maximum Flow Min cut theorem for Optimizing NetworkShethwala Ridhvesh
 
Fractional Calculus PP
Fractional Calculus PPFractional Calculus PP
Fractional Calculus PPVRRITC
 
4. Linear Algebra for Machine Learning: Eigenvalues, Eigenvectors and Diagona...
4. Linear Algebra for Machine Learning: Eigenvalues, Eigenvectors and Diagona...4. Linear Algebra for Machine Learning: Eigenvalues, Eigenvectors and Diagona...
4. Linear Algebra for Machine Learning: Eigenvalues, Eigenvectors and Diagona...Ceni Babaoglu, PhD
 
polynomial linear regression
polynomial linear regressionpolynomial linear regression
polynomial linear regressionAkhilesh Joshi
 
Neural Networks and Genetic Algorithms Multiobjective acceleration
Neural Networks and Genetic Algorithms Multiobjective accelerationNeural Networks and Genetic Algorithms Multiobjective acceleration
Neural Networks and Genetic Algorithms Multiobjective accelerationArmando Vieira
 
Fuzzy logic and application in AI
Fuzzy logic and application in AIFuzzy logic and application in AI
Fuzzy logic and application in AIIldar Nurgaliev
 
Interpolation functions
Interpolation functionsInterpolation functions
Interpolation functionsTarun Gehlot
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsFrancesco Casalegno
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Examplekailash shaw
 

What's hot (20)

random forest regression
random forest regressionrandom forest regression
random forest regression
 
Linear regression in machine learning
Linear regression in machine learningLinear regression in machine learning
Linear regression in machine learning
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Dempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th Sem
 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysis
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
 
Network flows
Network flowsNetwork flows
Network flows
 
A Maximum Flow Min cut theorem for Optimizing Network
A Maximum Flow Min cut theorem for Optimizing NetworkA Maximum Flow Min cut theorem for Optimizing Network
A Maximum Flow Min cut theorem for Optimizing Network
 
Branch and bound
Branch and boundBranch and bound
Branch and bound
 
Fractional Calculus PP
Fractional Calculus PPFractional Calculus PP
Fractional Calculus PP
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
4. Linear Algebra for Machine Learning: Eigenvalues, Eigenvectors and Diagona...
4. Linear Algebra for Machine Learning: Eigenvalues, Eigenvectors and Diagona...4. Linear Algebra for Machine Learning: Eigenvalues, Eigenvectors and Diagona...
4. Linear Algebra for Machine Learning: Eigenvalues, Eigenvectors and Diagona...
 
polynomial linear regression
polynomial linear regressionpolynomial linear regression
polynomial linear regression
 
Neural Networks and Genetic Algorithms Multiobjective acceleration
Neural Networks and Genetic Algorithms Multiobjective accelerationNeural Networks and Genetic Algorithms Multiobjective acceleration
Neural Networks and Genetic Algorithms Multiobjective acceleration
 
Fuzzy logic and application in AI
Fuzzy logic and application in AIFuzzy logic and application in AI
Fuzzy logic and application in AI
 
Interpolation functions
Interpolation functionsInterpolation functions
Interpolation functions
 
Curve fitting
Curve fittingCurve fitting
Curve fitting
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Example
 

Viewers also liked

Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learningananth
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overviewananth
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
The importance of strong entropy for iot
The importance of strong entropy for iotThe importance of strong entropy for iot
The importance of strong entropy for iotArm
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemErika G. G.
 
Sentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine LearningSentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine LearningNihar Suryawanshi
 
10.3 - Entropy and the 2nd law
10.3 - Entropy and the 2nd law10.3 - Entropy and the 2nd law
10.3 - Entropy and the 2nd lawsimonandisa
 
Marnus Koorts - Masters Project
Marnus Koorts - Masters ProjectMarnus Koorts - Masters Project
Marnus Koorts - Masters ProjectMarnus Koorts
 
NLP Secrets Seminar 2012 - Quantum Energy - 21062012
NLP Secrets Seminar 2012 - Quantum Energy - 21062012NLP Secrets Seminar 2012 - Quantum Energy - 21062012
NLP Secrets Seminar 2012 - Quantum Energy - 21062012Grant Hamel
 
Maxent Tutorial Slides
Maxent Tutorial SlidesMaxent Tutorial Slides
Maxent Tutorial Slidesjianingy
 
Machine Learning for Computer Games
Machine Learning for Computer GamesMachine Learning for Computer Games
Machine Learning for Computer Gamesbutest
 
Two parameter entropy of uncertain variable
Two parameter entropy of uncertain variableTwo parameter entropy of uncertain variable
Two parameter entropy of uncertain variableSurender Singh
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysisharit66
 
Heat transfer seminar
Heat transfer seminarHeat transfer seminar
Heat transfer seminaramaljo joju e
 
Hierarchichal species distributions model and Maxent
Hierarchichal species distributions model and MaxentHierarchichal species distributions model and Maxent
Hierarchichal species distributions model and Maxentrichardchandler
 

Viewers also liked (20)

Entropy
EntropyEntropy
Entropy
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
 
SSL12 Entropy
SSL12 EntropySSL12 Entropy
SSL12 Entropy
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overview
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
The importance of strong entropy for iot
The importance of strong entropy for iotThe importance of strong entropy for iot
The importance of strong entropy for iot
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation Problem
 
Sentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine LearningSentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine Learning
 
10.3 - Entropy and the 2nd law
10.3 - Entropy and the 2nd law10.3 - Entropy and the 2nd law
10.3 - Entropy and the 2nd law
 
Entropy
EntropyEntropy
Entropy
 
Marnus Koorts - Masters Project
Marnus Koorts - Masters ProjectMarnus Koorts - Masters Project
Marnus Koorts - Masters Project
 
MaxEnt 2009 talk
MaxEnt 2009 talkMaxEnt 2009 talk
MaxEnt 2009 talk
 
NLP Secrets Seminar 2012 - Quantum Energy - 21062012
NLP Secrets Seminar 2012 - Quantum Energy - 21062012NLP Secrets Seminar 2012 - Quantum Energy - 21062012
NLP Secrets Seminar 2012 - Quantum Energy - 21062012
 
Maxent Tutorial Slides
Maxent Tutorial SlidesMaxent Tutorial Slides
Maxent Tutorial Slides
 
Machine Learning for Computer Games
Machine Learning for Computer GamesMachine Learning for Computer Games
Machine Learning for Computer Games
 
Two parameter entropy of uncertain variable
Two parameter entropy of uncertain variableTwo parameter entropy of uncertain variable
Two parameter entropy of uncertain variable
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Heat transfer seminar
Heat transfer seminarHeat transfer seminar
Heat transfer seminar
 
Hierarchichal species distributions model and Maxent
Hierarchichal species distributions model and MaxentHierarchichal species distributions model and Maxent
Hierarchichal species distributions model and Maxent
 
Mallet
MalletMallet
Mallet
 

Similar to Principle of Maximum Entropy

NIPS2007: learning using many examples
NIPS2007: learning using many examplesNIPS2007: learning using many examples
NIPS2007: learning using many exampleszukun
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimationFabian Pedregosa
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
Statistical Machine________ Learning.ppt
Statistical Machine________ Learning.pptStatistical Machine________ Learning.ppt
Statistical Machine________ Learning.pptSandeepGupta229023
 
Regression_1.pdf
Regression_1.pdfRegression_1.pdf
Regression_1.pdfAmir Saleh
 
Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxFreefireGarena30
 
Global Optimization with Descending Region Algorithm
Global Optimization with Descending Region AlgorithmGlobal Optimization with Descending Region Algorithm
Global Optimization with Descending Region AlgorithmLoc Nguyen
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelJunya Tanaka
 
Machine learning (1)
Machine learning (1)Machine learning (1)
Machine learning (1)NYversity
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_reportRavi Gupta
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 
Domain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewDomain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewChia-Ching Lin
 
Cuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and ApplicationsCuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and ApplicationsXin-She Yang
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Asma Ben Slimene
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big DataGianvito Siciliano
 

Similar to Principle of Maximum Entropy (20)

NIPS2007: learning using many examples
NIPS2007: learning using many examplesNIPS2007: learning using many examples
NIPS2007: learning using many examples
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Statistical Machine________ Learning.ppt
Statistical Machine________ Learning.pptStatistical Machine________ Learning.ppt
Statistical Machine________ Learning.ppt
 
Machine learning
Machine learningMachine learning
Machine learning
 
Regression_1.pdf
Regression_1.pdfRegression_1.pdf
Regression_1.pdf
 
PhysicsSIG2008-01-Seneviratne
PhysicsSIG2008-01-SeneviratnePhysicsSIG2008-01-Seneviratne
PhysicsSIG2008-01-Seneviratne
 
Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptx
 
Global Optimization with Descending Region Algorithm
Global Optimization with Descending Region AlgorithmGlobal Optimization with Descending Region Algorithm
Global Optimization with Descending Region Algorithm
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic Model
 
Machine learning (1)
Machine learning (1)Machine learning (1)
Machine learning (1)
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
 
ppt
pptppt
ppt
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
Domain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewDomain adaptation: A Theoretical View
Domain adaptation: A Theoretical View
 
Cuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and ApplicationsCuckoo Search: Recent Advances and Applications
Cuckoo Search: Recent Advances and Applications
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big Data
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Principle of Maximum Entropy

  • 1. Principle of Maximum Entropy Jiawang Liu liujiawang@baidu.com 2012.6
  • 2. Outline  What is Entropy  Principle of Maximum Entropy  Relation to Maximum Likelihood  MaxEnt methods and Bayesian  Applications  NLP(POS tagging)  Logistic regression  Q&A
  • 3. What is Entropy In information theory, entropy is the measure of the amount of information that is missing before reception and is sometimes referred to as Shannon entropy.  Uncertainty
  • 4. Principle of Maximum Entropy Subject to precisely stated prior data, which must be a proposition that expresses testable information, the probability distribution which best represents the current state of knowledge is the one with largest information theoretical entropy. Why maximum entropy?  Minimize commitment  Model all that is known and assume nothing about what is unknown
  • 5. Principle of Maximum Entropy Overview  Should guarantee the uniqueness and consistency of probability assignments obtained by different methods  Makes explicit our freedom in using different forms of prior data  Admits the most ignorance beyond the stated prior data
  • 6. Principle of Maximum Entropy Testable information  The principle of maximum entropy is useful explicitly only when applied to testable information  A piece of information is testable if it can be determined whether a give distribution is consistent with it.  An example: The expectation of the variable x is 2.87 and p2 + p3 > 0.6
  • 7. Principle of Maximum Entropy General solution  Entropy maximization with no testable information  Given testable information  Seek the probability distribution which maximizes information entropy, subject to the constraints of the information.  A constrained optimization problem. It can be typically solved using the method of Lagrange Multipliers.
  • 8. Principle of Maximum Entropy General solution  Question  Seek the probability distribution which maximizes information entropy, subject to some linear constraints.  Mathematical problem  Optimization Problem  non-linear programming with linear constraints  Idea non-linear non-linear get result programming programming with linear with no constraints constraints • Lagrange • Let • partial multipliers derivative differential equals to 0
  • 9. Principle of Maximum Entropy General solution  Constraints  Some testable information I about a quantity x taking values in {x1, x2,..., xn}. Express this information as m constraints on the expectations of the functions fk, that is, we require our probability distribution to satisfy  Furthermore, the probabilities must sum to one, giving the constraint  Objective function
  • 10. Principle of Maximum Entropy General solution  The probability distribution with maximum information entropy subject to these constraints is  The normalization constant is determined by  The λk parameters are Lagrange multipliers whose particular values are determined by the constraints according to  These m simultaneous equations do not generally possess a closed form solution, and are usually solved by numerical methods.
  • 11. Principle of Maximum Entropy Training Model  Generalized Iterative Scaling (GIS) (Darroch and Ratcliff, 1972)  Improved Iterative Scaling (IIS) (Della Pietra et al., 1995)
  • 12. Principle of Maximum Entropy Training Model  Generalized Iterative Scaling (GIS) (Darroch and Ratcliff, 1972)  Compute dj, j=1, …, k+1  Initialize (any values, e.g., 0)  Repeat until converge • For each j – Compute – Compute – Update
  • 13. Principle of Maximum Entropy Training Model  Generalized Iterative Scaling (GIS) (Darroch and Ratcliff, 1972)  The running time of each iteration is O(NPA): • N: the training set size • P: the number of classes • A: the average number of features that are active for a given event (a, b).
  • 14. Principle of Maximum Entropy Relation to Maximum Likelihood  Likelihood function  P(x) is the distribution of estimation  is the empirical distribution  Log-Likelihood function
  • 15. Principle of Maximum Entropy Relation to Maximum Likelihood  Theorem  The model p*C with maximum entropy is the model in the parametric family p(y|x) that maximizes the likelihood of the training sample.  Coincidence?  Entropy – the measure of uncertainty  Likelihood – the degree of identical to knowledge  Maximum entropy - assume nothing about what is unknown  Maximum likelihood – impartially understand the knowledge Knowledge = complementary set of uncertainty
  • 16. Principle of Maximum Entropy MaxEnt methods and Bayesian  Bayesian methods p(H|DI) = p(H|I)p(D|HI) / p(D|I)  H stands for some hypothesis whose truth we want to judge  D for a set of data  I for prior information  Difference  A single application of Bayes’ theorem gives us only a probability, not a probability distribution  MaxEnt gives us necessarily a probability distribution, not just a probability.
  • 17. Principle of Maximum Entropy MaxEnt methods and Bayesian  Difference (continue)  Bayes’ theorem cannot determine the numerical value of any probability directly from our information. To apply it one must first use some other principle to translae information into numerical values for p(H|I), p(D|HI), p(D|I)  MaxEnt does not require for input the numerical values of any probabilities on the hypothesis space.  In common  The updating of a state of knowledge  Bayes’ rule and MaxEnt are completely compatible and can be seen as special cases of the method of MaxEnt. (Giffin et al. 2007)
  • 18. Applications Maximum Entropy Model  NLP: POS Tagging, Parsing, PP attachment, Text Classification, LM, …  POS Tagging  Features  Model
  • 19. Applications Maximum Entropy Model  POS Tagging  Tagging with MaxEnt Model The conditional probability of a tag sequence t1,…, tn is given a sentence w1,…, wn and contexts C1,…, Cn  Model Estimation • The model should reflect the data – use the data to constrain the model • What form should the constraints take? – constrain the expected value of each feature
  • 20. Applications Maximum Entropy Model  POS Tagging  The Constraints • Expected value of each feature must satisfy some constraint Ki • A natural choice for Ki is the average empirical count • derived from the training data (C1, t1), (C2, t2)…, (Cn, tn)
  • 21. Applications Maximum Entropy Model  POS Tagging  MaxEnt Model • The constraints do not uniquely identify a model • The maximum entropy model is the most uniform model – makes no assumptions in addition to what we know from the data • Set the weights to give the MaxEnt model satisfying the constraints – use Generalised Iterative Scaling (GIS)  Smoothing • empirical counts for low frequency features can be unreliable • Common smoothing technique is to ignore low frequency features • Use a prior distribution on the parameters
  • 22. Applications Maximum Entropy Model  Logistic regression  Classification • Linear regression for classification • The problems of linear regression for classification
  • 23. Applications Maximum Entropy Model  Logistic regression  Hypothesis representation • What function is used to represent our hypothesis in classification • We want our classifier to output values between 0 and 1 • When using linear regression we did hθ(x) = (θT x) • For classification hypothesis representation we do hθ(x) = g((θT x)) Where we define g(z), z is a real number g(z) = 1/(1 + e-z) This is the sigmoid function, or the logistic function
  • 24. Applications Maximum Entropy Model  Logistic regression  Cost function for logistic regression • Hypothesis representation • Linear regression uses the following function to determine θ • Define cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2 • Redefine J(Θ) • J(Θ) does not work for logistic regression, since it’s a non-convex function
  • 25. Applications Maximum Entropy Model  Logistic regression  Cost function for logistic regression • A convex logistic regression cost function
  • 26. Applications Maximum Entropy Model  Logistic regression  Simplified cost function • For binary classification problems y is always 0 or 1 • So we can write cost function is cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) ) • So, in summary, our cost function for the θ parameters can be defined as • Find parameters θ which minimize J(θ)
  • 27. Applications Maximum Entropy Model  Logistic regression  How to minimize the logistic regression cost function  Use gradient descent to minimize J(θ)
  • 28. Applications Maximum Entropy Model  Logistic regression  Advanced optimization • Good for large machine learning problems (e.g. huge feature set) • What is gradient descent actually doing? – compute J(θ) and the derivatives – plug these values into gradient descent • Alternatively, instead of gradient descent to minimize the cost function we could use – Conjugate gradient – BFGS (Broyden-Fletcher-Goldfarb-Shanno) – L-BFGS (Limited memory - BFGS)
  • 29. Applications Maximum Entropy Model  Logistic regression  Why do we chose this function when other cost functions exist? • This cost function can be derived from statistics using the principle of maximum likelihood estimation – Note this does mean there's an underlying Gaussian assumption relating to the distribution of features • Also has the nice property that it's convex
  • 30. Q&A Thanks!
  • 31. Reference  Jaynes, E. T., 1988, 'The Relation of Bayesian and Maximum Entropy Methods', in Maximum-Entropy and Bayesian Methods in Science and Engineering (Vol. 1), Kluwer Academic Publishers, p. 25-26.  https://www.coursera.org/course/ml  The elements of statistical learning, 4.4.  Kitamura, Y., 2006, Empirical Likelihood Methods in Econometrics: Theory and Practice, Cowles Foundation Discussion Papers 1569, Cowles Foundation, Yale University  http://en.wikipedia.org/wiki/Principle_of_maximum_entropy  Lazar, N., 2003, "Bayesian Empirical Likelihood", Biometrika, 90, 319-326.  Giffin, A. and Caticha, A., 2007, Updating Probabilities with Data and Moments  Guiasu, S. and Shenitzer, A., 1985, 'The principle of maximum entropy', The Mathematical Intelligencer, 7(1), 42-48.  Harremoës P. and Topsøe F., 2001, Maximum Entropy Fundamentals, Entropy, 3(3), 191-226.  http://en.wikipedia.org/wiki/Logistic_regression