SlideShare a Scribd company logo
1 of 89
Seminar: Statistical NLP


   Machine Learning for
Natural Language Processing
              Lluís Màrquez
            TALP Research Center
      Llenguatges i Sistemes Informàtics
      Universitat Politècnica de Catalunya

              Girona, June 2003


               Machine Learning for NLP      30/06/2003
Outline
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
• Applications to NLP




          Machine Learning for NLP   30/06/2003
Outline
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
• Applications to NLP




          Machine Learning for NLP   30/06/2003
ML4NLP
               Machine Learning
   • There are many general-purpose definitions of Machine
     Learning (or artificial learning):


         Making a computer automatically acquire some
         kind of knowledge from a concrete data domain


   • Learners are computers: we study learning algorithms
   • Resources are scarce: time, memory, data, etc.
   • It has (almost) nothing to do with: Cognitive science,
     neuroscience, theory of scientific discovery and research, etc.
   • Biological plausibility is welcome but not the main goal



                           Machine Learning for NLP               30/06/2003
ML4NLP
                Machine Learning
   • Learning... but what for?
         – To perform some particular task
         – To react to environmental inputs
         – Concept learning from data:
            • modelling concepts underlying data
            • predicting unseen observations
            • compacting the knowledge representation
            • knowledge discovery for expert systems

   • We will concentrate on:
         – Supervised inductive learning for classification
           = discriminative learning


                             Machine Learning for NLP         30/06/2003
ML4NLP
             Machine Learning

     A more precise definition:

       Obtaining a description of the concept in
     some representation language that explains
        observations and helps predicting new
          instances of the same distribution


     • What to read?
         – Machine Learning (Mitchell, 1997)


                        Machine Learning for NLP   30/06/2003
ML4NLP
                       Empirical NLP
    90’s: Application of Machine Learning techniques
          (ML) to NLP problems

    • Lexical and structural ambiguity problems
         –   Word selection (SR, MT)
         –   Part-of-speech tagging
                                                        Clasification
         –   Semantic ambiguity (polysemy)
                                                         problems
         –   Prepositional phrase attachment
         –   Reference ambiguity (anaphora)
         –   etc.

    • What to read? Foundations of Statistical Language
         Processing   (Manning & Schütze, 1999)

                             Machine Learning for NLP                   30/06/2003
ML4NLP

     NLP “classification” problems
     • Ambiguity is a crucial problem for natural
       language understanding/processing.
       Ambiguity Resolution = Classification


         He was shot in the hand as he chased
            the robbers in the back street

                               (The Wall Street Journal Corpus)




                      Machine Learning for NLP               30/06/2003
ML4NLP

     NLP “classification” problems

     • Morpho-syntactic ambiguity




         He was shot in the hand as he chased
            the robbers in the back street JJ
                 NN          NN
                 VB                 VB               VB

                               (The Wall Street Journal Corpus)




                      Machine Learning for NLP               30/06/2003
ML4NLP

     NLP “classification” problems

     • Morpho-syntactic ambiguity:
       Part of Speech Tagging


         He was shot in the hand as he chased
            the robbers in the back street JJ
                 NN          NN
                 VB                 VB               VB

                               (The Wall Street Journal Corpus)




                      Machine Learning for NLP               30/06/2003
ML4NLP

     NLP “classification” problems

     • Semantic (lexical) ambiguity




         He was shot in the hand as he chased
            the robbers in body-part street
                            the back
                                clock-part

                               (The Wall Street Journal Corpus)




                      Machine Learning for NLP               30/06/2003
ML4NLP

     NLP “classification” problems

     • Semantic (lexical) ambiguity:
       Word Sense Disambiguation


         He was shot in the hand as he chased
            the robbers in body-part street
                            the back
                                clock-part

                               (The Wall Street Journal Corpus)




                      Machine Learning for NLP               30/06/2003
ML4NLP

     NLP “classification” problems

     • Structural (syntactic) ambiguity




         He was shot in the hand as he chased
            the robbers in the back street

                               (The Wall Street Journal Corpus)




                      Machine Learning for NLP               30/06/2003
ML4NLP

     NLP “classification” problems

     • Structural (syntactic) ambiguity




         He was shot in the hand as he chased
            the robbers in the back street

                               (The Wall Street Journal Corpus)




                      Machine Learning for NLP               30/06/2003
ML4NLP

     NLP “classification” problems

     • Structural (syntactic) ambiguity:
       PP-attachment disambiguation


         He was shot in the hand as he (chased
         (the robbers)NP (in the back street)PP)

                               (The Wall Street Journal Corpus)




                      Machine Learning for NLP               30/06/2003
Outline
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms in detail
• Applications to NLP




          Machine Learning for NLP   30/06/2003
Classification

         Feature Vector Classification
                                                                       IA
                                                                   perspective
      • An instance is a vector: x=<x1,…, xn> whose components,
        called features (or attributes), are discrete or real-valued.

      • Let X be the space of all possible instances.

      • Let Y={y1,…, ym} be the set of categories (or classes).

      • The goal is to learn an unknown target function, f : X        Y
      • A training example is an instance x belonging to X, labelled
         with the correct value for f(x), i.e., a pair <x, f(x)>

      • Let D be the set of all training examples.


                              Machine Learning for NLP                 30/06/2003
Classification

         Feature Vector Classification

    • The hypotheses space, H, is the set of functions h: X   Y
      that the learner can consider as possible definitions


          The goal is to find a function h belonging to H
          such that for all pair <x, f (x)> belonging to D,
                             h(x) = f (x)




                           Machine Learning for NLP           30/06/2003
Classification
                           An Example
                 Example    SIZE      COLOR      SHAPE           CLASS
                    1       small      red        circle         positive
                    2       big        red        circle         positive
                    3       small      red       triangle    negative
                    4       big        blue       circle     negative


             Rules                                   Decision Tree
                                                                 COLOR
       (COLOR=red)
                                                           red              blue
       (SHAPE=circle)      positive
                                                         SHAPE          negative
          otherwise        negative
                                               circle            triangle

                                              positive      negative

                              Machine Learning for NLP                         30/06/2003
Classification
                           An Example
                 Example     SIZE     COLOR         SHAPE        CLASS
                    1        small     red           circle      positive
                   2          big          red      circle       positive
                   3         small         red     triangle      negative
                   4          big       blue        circle       negative


              Rules                                       Decision Tree
                                                                  SIZE
(SIZE=small) (SHAPE=circle)     positive
                                                          small             big
 (SIZE=big) (COLOR=red)         positive
                 otherwise      negative                 SHAPE              COLOR
                                                 circle       triang red           blue

                                                   pos        neg        pos       neg

                               Machine Learning for NLP                           30/06/2003
Classification

         Some important concepts
     • Inductive Bias
        “Any means that a classification learning system uses to choose
        between to functions that are both consistent with the training
        data is called inductive bias” (Mooney & Cardie, 99)
         – Language / Search bias

                                 Decision Tree
                                              COLOR

                                      red              blue

                                      SHAPE           negative

                            circle            triangle

                           positive     negative




                             Machine Learning for NLP             30/06/2003
Classification

         Some important concepts
     • Inductive Bias

     • Training error and generalization error

     • Generalization ability and overfitting

     • Batch Learning vs. on-line Leaning

     • Symbolic vs. statistical Learning

     • Propositional vs. first-order learning



                        Machine Learning for NLP   30/06/2003
Classification

                     Propositional vs.
                    Relational Learning
    • Propositional learning

                 color(red)      shape(circle)              classA

    • Relational learning = ILP (induction of logic programs)

        course(X)    person(Y)     link_to(Y,X)          instructor_of(X,Y)

    research_project(X) person(Z) link_to(L1,X,Y)
    link_to(L2,Y,Z) neighbour_word_people(L1)    member_proj(X,Z)


                              Machine Learning for NLP                    30/06/2003
Classification
            The Classification Setting
             Class, Point, Example, Data Set, ...
                                                           CoLT/SLT
         • Input Space: X      Rn                         perspective

         • (binary) Output Space: Y = {+1,-1}
         • A point, pattern or instance:
                    x   X, x = (x1, x2, …, xn)
         • Example: (x, y) with x            X, y    Y
         • Training Set: a set of m examples generated i.i.d.
           according to an unknown distribution P(x,y)
                 S = {(x1, y1), …, (xm, ym)}         (X    Y)m

                          Machine Learning for NLP               30/06/2003
Classification
            The Classification Setting
                     Learning, Error, ...
         • The hypotheses space, H, is the set of functions
           h: X Y that the learner can consider as possible
           definitions. In SVM are of the form:
                                n
                       h( x)         wi i (x) b
                               i 1


         • The goal is to find a function h belonging to H such
           that the expected misclassification error on new
           examples, also drawn from P(x,y), is minimal
           (Risk Minimization, RM)

                         Machine Learning for NLP        30/06/2003
Classification
            The Classification Setting
                     Learning, Error, ...
        • Expected error (risk)

                   Rh       loss h(x), y dP (x, y )

        • Problem: P itself is unknown. Known are training
          examples   an induction principle is needed
        • Empirical Risk Minimization (ERM): Find the
          function h belonging to H for which the training
          error (empirical risk) is minimal
                                    m
                  Remp h    1m      i 1
                                          loss h(x i ), yi

                           Machine Learning for NLP          30/06/2003
Classification
            The Classification Setting
                 Error, Over(under)fitting,...
        • Low training error     low true error?
        • The overfitting dilemma:




                          Underfitting              Overfitting
        • Trade-off between training error and complexity
        • Different learning biases can be used
                         Machine Learning for NLP            30/06/2003
Outline
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
• Applications to NLP




          Machine Learning for NLP   30/06/2003
Outline
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
  −Decision Trees
  −AdaBoost
  −Support Vector Machines

• Applications to NLP
          Machine Learning for NLP   30/06/2003
Algorithms
                 Learning Paradigms

       • Statistical learning:
             – HMM, Bayesian Networks, ME, CRF, etc.
       • Traditional methods from Artificial Intelligence
         (ML, AI)
             – Decision trees/lists, exemplar-based learning, rule
               induction, neural networks, etc.

       • Methods from Computational Learning
         Theory (CoLT/SLT)
             – Winnow, AdaBoost, SVM’s, etc.


                           Machine Learning for NLP         30/06/2003
Algorithms
              Learning Paradigms

       • Classifier combination:
             – Bagging, Boosting, Randomization, ECOC,
               Stacking, etc.

       • Semi-supervised learning: learning from
         labelled and unlabelled examples
             – Bootstrapping, EM, Transductive learning
               (SVM’s, AdaBoost), Co-Training, etc.

       • etc.


                          Machine Learning for NLP        30/06/2003
Algorithms
                  Decision Trees
  • Decision trees are a way to represent rules underlying
    training data, with hierarchical structures that
    recursively partition the data.

  • They have been used by many research communities
    (Pattern Recognition, Statistics, ML, etc.) for data
    exploration with some of the following purposes:
    Description, Classification, and Generalization.

  • From a machine-learning perspective: Decision Trees
    are n -ary branching trees that represent classification
    rules for classifying the objects of a certain domain into
    a set of mutually exclusive classes

                        Machine Learning for NLP         30/06/2003
Algorithms
                    Decision Trees
     • Acquisition:
       Top-Down Induction of Decision Trees
       (TDIDT)
     • Systems:
             CART (Breiman et al. 84),
             ID3, C4.5, C5.0 (Quinlan 86,93,98),
             ASSISTANT, ASSISTANT-R (Cestnik et al. 87)
             (Kononenko et al. 95)
             etc.



                           Machine Learning for NLP       30/06/2003
Algorithms
                                An Example
                                                           A1
                                                v1                    v3
                                                           v2
                                         A2          A3         ...         A2

                                         ...    v4         v5               ...
           Decision Tree                        A5             A2

                                               ...        v6
                   SIZE
           small            big
                                                          A5           C3

          SHAPE            COLOR               v7
  circle      triang red          blue
                                         C1          C2         C1
    pos       neg         pos     neg


                                  Machine Learning for NLP                        30/06/2003
Algorithms
         Learning Decision Trees
             Training


                 Training
                            + TDIDT =
                    Set
                                                  DT



             Test



                 Example     +              = Class
                                   DT



                            Machine Learning for NLP   30/06/2003
Algorithms




         General Induction Algorithm
                                       function TDIDT (X:set-of-examples; A:set-of-features)
                                         var: tree1,tree2: decision-tree;
                                               X’: set-of-examples;
                                               A’: set-of-features
                                         end-var
                                         if (stopping_criterion (X)) then
                                            tree1 := create_leaf_tree (X)
                                         else
                                            amax := feature_selection (X,A);
                                            tree1 := create_tree (X, amax);
                                            for-all val in values (amax) do
                                              X’ := select_examples (X,amax,val);
                                              A’ := A - {amax};
                                              tree2 := TDIDT (X’,A’);
                                              tree1 := add_branch (tree1,tree2,val)
                                            end-for
                                         end-if
                                         return (tree1)
                                       end-function


                                                  Machine Learning for NLP                     30/06/2003
Algorithms




         General Induction Algorithm
                                       function TDIDT (X:set-of-examples; A:set-of-features)
                                         var: tree1,tree2: decision-tree;
                                               X’: set-of-examples;
                                               A’: set-of-features
                                         end-var
                                         if (stopping_criterion (X)) then
                                            tree1 := create_leaf_tree (X)
                                         else
                                            amax := feature_selection (X,A);
                                            tree1 := create_tree (X, amax);
                                            for-all val in values (amax) do
                                              X’ := select_examples (X,amax,val);
                                              A’ := A - {amax};
                                              tree2 := TDIDT (X’,A’);
                                              tree1 := add_branch (tree1,tree2,val)
                                            end-for
                                         end-if
                                         return (tree1)
                                       end-function


                                                  Machine Learning for NLP                     30/06/2003
Algorithms

             Feature Selection Criteria
     • Functions derived from Information Theory:
        – Information Gain, Gain Ratio (Quinlan 86)

     • Functions derived from Distance Measures
        – Gini Diversity Index (Breiman et al. 84)
        – RLM (López de Mántaras 91)

     • Statistically-based
        – Chi-square test (Sestito & Dillon 94)
        – Symmetrical Tau (Zhou & Dillon 91)

     • RELIEFF-IG: variant of RELIEFF (Kononenko 94)

                          Machine Learning for NLP     30/06/2003
Algorithms
                Extensions of DTs
                                      (Murthy 95)


      • Pruning (pre/post)
      • Minimize the effect of the greedy approach:
        lookahead
      • Non-lineal splits
      • Combination of multiple models
      • Incremental learning (on-line)
      • etc.

                       Machine Learning for NLP     30/06/2003
Algorithms
             Decision Trees and NLP
    • Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)
    • POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez &
      Rodríguez 95,97; Màrquez et al. 00)

    • Word sense disambiguation (Brown et al. 91; Cardie 93;
      Mooney 96)

    • Parsing (Magerman 95,96; Haruno et al. 98,99)
    • Text categorization (Lewis & Ringuette 94; Weiss et al. 99)
    • Text summarization (Mani & Bloedorn 98)
    • Dialogue act tagging (Samuel et al. 98)


                           Machine Learning for NLP            30/06/2003
Algorithms
             Decision Trees and NLP
     • Noun phrase coreference
        (Aone & Benett 95; Mc Carthy & Lehnert 95)

     • Discourse analysis in information extraction
        (Soderland & Lehnert 94)

     • Cue phrase identification in text and speech
        (Litman 94; Siegel & McKeown 94)

     • Verb classification in Machine Translation
        (Tanaka 96; Siegel 97)




                           Machine Learning for NLP   30/06/2003
Algorithms

        Decision Trees: pros&cons

    • Advantages
       – Acquires symbolic knowledge in a
         understandable way
       – Very well studied ML algorithms and variants
       – Can be easily translated into rules
       – Existence of available software: C4.5, C5.0, etc.
       – Can be easily integrated into an ensemble




                         Machine Learning for NLP        30/06/2003
Algorithms


        Decision Trees: pros&cons
  • Drawbacks
     – Computationally expensive when scaling to large
       natural language domains: training examples,
       features, etc.
     – Data sparseness and data fragmentation: the problem
       of the small disjuncts => Probability estimation
     – DTs is a model with high variance (unstable)
     – Tendency to overfit training data: pruning is necessary
     – Requires quite a big effort in tuning the model


                        Machine Learning for NLP         30/06/2003
Algorithms
             Boosting algorithms
   • Idea
     “to combine many simple and moderately accurate
      hypotheses (weak classifiers) into a single and highly
      accurate classifier”

   • AdaBoost (Freund & Schapire 95) has been
     theoretically and empirically studied extensively

   • Many other variants extensions (1997-2003)
      http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html




                       Machine Learning for NLP        30/06/2003
Algorithms
             AdaBoost: general scheme

                                                         Linear
                                   F(h1,h2,...,hT)
                                                       combination
                                   2



                                             ...
               h1             h2                                 hT

              Weak           Weak                                Weak
             Learner                        Probability         Learner
                            Learner
                                            distribution
                                              updating

              TS1            TS2
                                              ...                TST
        D1             D2                                  DT
                            Machine Learning for NLP                      30/06/2003
Algorithms
             AdaBoost: algorithm




                           (Freund & Schapire 97)




                  Machine Learning for NLP          30/06/2003
Algorithms
             AdaBoost: example




    Weak hypotheses = vertical/horizontal hyperplanes

                      Machine Learning for NLP      30/06/2003
Algorithms
             AdaBoost: round 1




                  Machine Learning for NLP   30/06/2003
Algorithms
             AdaBoost: round 2




                  Machine Learning for NLP   30/06/2003
Algorithms
             AdaBoost: round 3




                  Machine Learning for NLP   30/06/2003
Algorithms
             Combined Hypothesis




        www.research.att.com/~yoav/adaboost




                    Machine Learning for NLP   30/06/2003
Algorithms
                AdaBoost and NLP
      • POS Tagging (Abney et al. 99; Màrquez 99)
      • Text and Speech Categorization
         (Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99)

      • PP-attachment Disambiguation (Abney et al. 99)
      • Parsing (Haruno et al. 99)
      • Word Sense Disambiguation (Escudero et al. 00, 01)
      • Shallow parsing (Carreras & Màrquez, 01a; 02)
      • Email spam filtering (Carreras & Màrquez, 01b)
      • Term Extraction (Vivaldi, et al. 01)

                           Machine Learning for NLP                30/06/2003
Algorithms

             AdaBoost: pros&cons
       + Easy to implement and few parameters to set
       + Time and space grow linearly with number of
         examples. Ability to manage very large learning
         problems
       + Does not constrain explicitly the complexity of the
         learner
       + Naturally combines feature selection with learning
       + Has been succesfully applied to many practical
         problems

                        Machine Learning for NLP          30/06/2003
Algorithms

             AdaBoost: pros&cons

       ± Seems to be rather robust to overfitting
         (number of rounds) but sensitive to noise

       ± Performance is very good when there are
         relatively few relevant terms (features)

       – Can perform poorly when there is insufficient
         training data relative to the complexity of the
         base classifiers, the training errors of the base
         classifiers become too large too quickly



                        Machine Learning for NLP             30/06/2003
Algorithms



             SVM: A General Definition
      • “Support Vector Machines (SVM) are learning
        systems that use a hypothesis space of linear
        functions in a high dimensional feature space,
        trained with a learning algorithm from optimisation
        theory that implements a learning bias derived
        from statistical learning theory”.
        (Cristianini & Shawe-Taylor, 2000)




                           Machine Learning for NLP     30/06/2003
Algorithms


         SVM: A General Definition

      • “Support Vector Machines (SVM) are learning
        systems that use a hypothesis space of linear
        functions in a high dimensional feature space,
        trained with a learning algorithm from optimisation
        theory that implements a learning bias derived
        from statistical learning theory”.
        (Cristianini & Shawe-Taylor, 2000)
                                                      Key Concepts




                           Machine Learning for NLP            30/06/2003
Algorithms
                Linear Classifiers
    • Hyperplanes in RN.
    • Defined by a weight vector (w) and a threshold (b).
    • They induce a classification rule:

                                                   N
                    N
                                          1 if          wi xi b   0
        h(x) sign         wi xi b                 i 1
                    i 1
                                          1           otherwise
                             +
                     +                        +
                           +                      +
                    _    _            w
                           _          _     +
                      b _
                           _               _
                      w               _
                                           _

                            Machine Learning for NLP                  30/06/2003
Algorithms
             Optimal Hyperplane:
             Geometric Intuition




                  Machine Learning for NLP   30/06/2003
Algorithms
              Optimal Hyperplane:
              Geometric Intuition
   These are the
     Support
     Vectors
                                               Maximal
                                               Margin
                                              Hyperplane




                   Machine Learning for NLP       30/06/2003
Algorithms
             Linearly separable data




                                                                  Quadratic
       geometricmargin 2 / w 2                                   Programming
                                                           2
       maximizing the margin is equivalent to minimize w subject to constraint s :

                        yi ( w xi   b) 1 for all i 1,, l


Seminari SVMs                   Machine Learning for NLP                     22/05/2001
                                                                              30/06/2003
Algorithms
       Non-separable case (soft margin)




              1   ,,   l    positiveslack vari
                                              ables for introducin costs
                                                                 g
                                       n
                               2
             Minimize w            C         i   subject toconstraint :
                                                                    s
                                       i 1

                   yi ( w xi       b) 1             i   for all i 1,, l
                    i       0 for all i 1,, l

Seminari SVMs                          Machine Learning for NLP            22/05/2001
                                                                            30/06/2003
Algorithms
                      Non-linear SVMs
    • Implicit mapping into feature space via kernel functions

                :X          F        Non-linear mapping
                        n
             f ( x)           wi i (x) b         Set of hypotheses
                       i 1
                        l
             f (x)              i   yi (xi )    (x)     b Dual formulation
                       i 1


             K (x, z)               (x)   (z)      Kernel function
                         l
             f ( x)             i   yi K (xi , x) b     Evaluation
                        i 1



Seminari SVMs                         Machine Learning for NLP          22/05/2001
                                                                         30/06/2003
Algorithms
                   Non-linear SVMs
    • Kernel functions
       – Must be efficiently computable

       – Characterization via Mercer’s theorem

       – One of the curious facts about using a kernel is
             that we do not need to know the underlying
             feature map in order to be able to learn in the
             feature space! (Cristianini & Shawe-Taylor, 2000)
       – Examples: polynomials, Gaussian radial basis
         functions, two-layer sigmoidal neural networks,
         etc.

Seminari SVMs               Machine Learning for NLP         22/05/2001
                                                              30/06/2003
Algorithms

                Non linear SVMs
                  Degree 3 polynomial kernel




             lin. separable                     lin. non-separable

Seminari SVMs                 Machine Learning for NLP           22/05/2001
                                                                  30/06/2003
Algorithms

                      Toy Examples
      • All examples have been run with the 2D graphic
        interface of SVMLIB (Chang and Lin, National University
         of Taiwan)
         “LIBSVM is an integrated software for support vector classification,
         (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution
         estimation (one-class SVM). It supports multi-class classification. The
         basic algorithm is a simplification of both SMO by Platt and SVMLight
         by Joachims. It is also a simplification of the modification 2 of SMO by
         Keerthy et al. Our goal is to help users from other fields to easily use
         SVM as a tool. LIBSVM provides a simple interface where users can
         easily link it with their own programs…”

      • Available from: www.csie.ntu.edu.tw/~cjlin/libsvm
        (it icludes a Web integrated demo tool)



                               Machine Learning for NLP                     30/06/2003
Algorithms
             Toy Examples (I)


                                    Linearly separable data set
                                    Linear SVM
                                    Maximal margin Hyperplane




        .             What happens if we add
                      a blue training example
                                here?


                 Machine Learning for NLP                30/06/2003
Algorithms
             Toy Examples (I)


                                    (still) Linearly separable
                                    data set
                                    Linear SVM
                                    High value of C parameter
                                    Maximal margin Hyperplane




                     The example is
                    correctly classified


                 Machine Learning for NLP                   30/06/2003
Algorithms
             Toy Examples (I)


                                    (still) Linearly separable
                                    data set
                                    Linear SVM
                                    Low value of C parameter
                                    Trade-off between: margin
                                    and training error


                     The example is
                    now a bounded SV


                 Machine Learning for NLP                   30/06/2003
Algorithms
             Toy Examples (II)




                  Machine Learning for NLP   30/06/2003
Algorithms
             Toy Examples (II)




                  Machine Learning for NLP   30/06/2003
Algorithms
             Toy Examples (II)




                  Machine Learning for NLP   30/06/2003
Algorithms
             Toy Examples (III)




                  Machine Learning for NLP   30/06/2003
Algorithms

                SVM: Summary
     • SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik,
       1992). Great developement since then

     • Kernel-induced feature spaces: SVMs work efficiently
       in very high dimensional feature spaces (+)

     • Learning bias: maximal margin optimisation.
       Reduces the danger of overfitting. Generalization
       bounds for SVMs (+)

     • Compact representation of the induced hypothesis.
       The solution is sparse in terms of SVs (+)


                        Machine Learning for NLP       30/06/2003
Algorithms

                 SVM: Summary
     • Due to Mercer’s conditions on the kernels the optimi-
       sation problems are convex. No local minima (+)
     • Optimisation theory guides the implementation.
       Efficient learning (+)
     • Mainly for classification but also for regression,
       density estimation, clustering, etc.
     • Success in many real-world applications: OCR, vision,
       bioinformatics, speech recognition, NLP: TextCat, POS
       tagging, chunking, parsing, etc. (+)
     • Parameter tuning (–). Implications in convergence
       times, sparsity of the solution, etc.

                         Machine Learning for NLP           30/06/2003
Outline
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
• Applications to NLP




          Machine Learning for NLP   30/06/2003
Applications

                NLP problems

        • Warning! We will not focus on
          final NLP applications, but on
          intermediate tasks...

        • We will classify the NLP tasks
          according to their (structural)
          complexity



                     Machine Learning for NLP   30/06/2003
Applications
          NLP problems: structural
                complexity
       • Decisional problems
          − Text Categorization, Document filtering, Word
            Sense Disambiguation, etc.
       • Sequence tagging and detection of
         sequential structures
          − POS tagging, Named Entity extraction,
            syntactic chunking, etc.
       • Hierarchical structures
          − Clause detection, full parsing, IE of complex
            concepts, composite Named Entities, etc.
                         Machine Learning for NLP           30/06/2003
Applications

                  POS tagging
       • Morpho-syntactic ambiguity:
         Part of Speech Tagging


         He was shot in the hand as he chased
            the robbers in the back street JJ
                 NN          NN
                   VB                 VB               VB

                                 (The Wall Street Journal Corpus)




                        Machine Learning for NLP               30/06/2003
Applications

                          POS tagging
                            “preposition-adverb” tree
                                              root
                                            P(IN)=0.81
                                            P(RB)=0.19
                                            Word Form
                                                         “As”,“as”
                                   others
                                             ...              P(IN)=0.83
                                                              P(RB)=0.17
                                                               tag(+1)
                                                                            RB
                                                     others
                                                                ...         P(IN)=0.13
Probabilistic interpretation:                                               P(RB)=0.87
                                                                             tag(+2)
^
P( RB | word=“A/as”   tag(+1)=RB    tag(+2)=IN) = 0.987               IN
^
P( IN | word=“A/as”   tag(+1)=RB    tag(+2)=IN) = 0.013
                                                              P(IN)=0.013
                                                                            leaf
                                                              P(RB)=0.987

                                   Machine Learning for NLP                         30/06/2003
Applications

                     POS tagging
                      “preposition-adverb” tree
                                     root
                                   P(IN)=0.81
                                   P(RB)=0.19
                                   Word Form
                                                “As”,“as”
                          others
                                    ...              P(IN)=0.83
                                                     P(RB)=0.17
                                                      tag(+1)
                                                                   RB
       Collocations:                        others

       “as_RB much_RB as_IN”
                                                       ...         P(IN)=0.13
                                                                   P(RB)=0.87
                                                                    tag(+2)
       “as_RB soon_RB as_IN”                                 IN

       “as_RB well_RB as_IN”
                                                     P(IN)=0.013
                                                                   leaf
                                                     P(RB)=0.987

                          Machine Learning for NLP                         30/06/2003
Applications
                         POS tagging
                                       RTT (Màrquez & Rodríguez 97)



                                                  Language
                                                    Model

           A Sequential Model for Multi-class Classification:
               NLP/POS Tagging (Even-Zohar & Roth, 01)
                                                  stop?
         Morphological                                                     Tagged
 Raw
           analysis
                         Classify     Update        Filter
 text                                                             yes        text
                                                             no




                                    Disambiguation

                             Machine Learning for NLP                   30/06/2003
Applications
                               POS tagging
                                         STT (Màrquez & Rodríguez 97)

                                          Language Model

                                    Lexical
                                    probs.      +
                The Use of Classifiers in sequential inference:
                                            Contextual probs.
                      Chunking (Punyakanok & Roth, 00)



   Raw         Morphological                      Viterbi                  Tagged
                 analysis                       algorithm
   text                                                                      text


                                           Disambiguation
                                 Machine Learning for NLP               30/06/2003
Applications


       Detection of sequential and
         hierarchical structures


           • Named Entity recognition
           • Clause detection




                     Machine Learning for NLP   30/06/2003
Conclusions

       Summary/conclusions
      • We have briefly outlined:
         −The ML setting: “supervised learning for
          classification”
         −Three concrete machine learning
          algorithms
         −How to apply them to solve itermediate
          NLP tasks


                     Machine Learning for NLP   30/06/2003
Conclusions

       Summary/conclusions
      • Any ML algorithm for NLP should be:
         – Robust to noise and outliers
         – Efficient in large feature/example spaces
         – Adaptive to new/changing domains:
           portability, tuning, etc.
         – Able to take advantage of unlabelled
           examples: semi-supervised learning


                      Machine Learning for NLP    30/06/2003
Conclusions

       Summary/conclusions
      • Statistical and ML-based Natural
        Language Processing is a very active
        and multidisciplinary area of research




                    Machine Learning for NLP   30/06/2003
Conclusions

      Some current research lines
     • Appropriate learning paradigm for all kind of
       NLP problems: TiMBL (DBZ99), TBEDL (Brill95), ME
        (Ratnaparkhi98), SNoW (Roth98), CRF (Pereira & Singer02).

     • Definition of an adequate (and task-specific)
       feature space: mapping from the input space to a
        high dimensional feature space, kernels, etc.

     • Resolution of complex NLP problems:
        inference with classifiers + constraint satisfaction

     • etc.

                          Machine Learning for NLP           30/06/2003
Conclusions
                Bibliografia
     • You may found additional information at:
        http://www.lsi.upc.es/~lluism/
          tesi.html
          publicacions/pubs.html
          cursos/talks.html
          cursos/MLandNL.html
          cursos/emnlp1.html


     • This talk at:
       http://www.lsi.upc.es/~lluism/udg03.ppt.gz



                       Machine Learning for NLP     30/06/2003
Seminar: Statistical NLP


   Machine Learning for
Natural Language Processing
              Lluís Màrquez
            TALP Research Center
      Llenguatges i Sistemes Informàtics
      Universitat Politècnica de Catalunya

              Girona, June 2003


               Machine Learning for NLP      30/06/2003

More Related Content

Viewers also liked

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Why Now Is The Time For NLP
Why Now Is The Time For NLPWhy Now Is The Time For NLP
Why Now Is The Time For NLPLinda Ferguson
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersRoelof Pieters
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
neuro-linguistic programming
neuro-linguistic programmingneuro-linguistic programming
neuro-linguistic programmingMichael Buckley
 
Neuro Linguistic Programming
Neuro Linguistic ProgrammingNeuro Linguistic Programming
Neuro Linguistic Programmingsmjk
 
How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...
How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...
How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...Health Catalyst
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
 
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...Maarten van Wesel
 
My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...Ahmed Mater
 
Authorship attribution
Authorship attributionAuthorship attribution
Authorship attributionReza Ramezani
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaTraian Rebedea
 

Viewers also liked (20)

NLP for business analysts
NLP for business analystsNLP for business analysts
NLP for business analysts
 
NLP_lectures_English
NLP_lectures_EnglishNLP_lectures_English
NLP_lectures_English
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Deeplearning NLP
Deeplearning NLPDeeplearning NLP
Deeplearning NLP
 
Why Now Is The Time For NLP
Why Now Is The Time For NLPWhy Now Is The Time For NLP
Why Now Is The Time For NLP
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Neuro linguistic programming(nlp)
Neuro linguistic programming(nlp)Neuro linguistic programming(nlp)
Neuro linguistic programming(nlp)
 
neuro-linguistic programming
neuro-linguistic programmingneuro-linguistic programming
neuro-linguistic programming
 
NLP for project managers
NLP for project managersNLP for project managers
NLP for project managers
 
Neuro Linguistic Programming
Neuro Linguistic ProgrammingNeuro Linguistic Programming
Neuro Linguistic Programming
 
How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...
How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...
How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Develope yourself nlp
Develope yourself nlpDevelope yourself nlp
Develope yourself nlp
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
 
My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...
 
Authorship attribution
Authorship attributionAuthorship attribution
Authorship attribution
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Machine Learning for NLP

  • 1. Seminar: Statistical NLP Machine Learning for Natural Language Processing Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Girona, June 2003 Machine Learning for NLP 30/06/2003
  • 2. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP Machine Learning for NLP 30/06/2003
  • 3. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP Machine Learning for NLP 30/06/2003
  • 4. ML4NLP Machine Learning • There are many general-purpose definitions of Machine Learning (or artificial learning): Making a computer automatically acquire some kind of knowledge from a concrete data domain • Learners are computers: we study learning algorithms • Resources are scarce: time, memory, data, etc. • It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc. • Biological plausibility is welcome but not the main goal Machine Learning for NLP 30/06/2003
  • 5. ML4NLP Machine Learning • Learning... but what for? – To perform some particular task – To react to environmental inputs – Concept learning from data: • modelling concepts underlying data • predicting unseen observations • compacting the knowledge representation • knowledge discovery for expert systems • We will concentrate on: – Supervised inductive learning for classification = discriminative learning Machine Learning for NLP 30/06/2003
  • 6. ML4NLP Machine Learning A more precise definition: Obtaining a description of the concept in some representation language that explains observations and helps predicting new instances of the same distribution • What to read? – Machine Learning (Mitchell, 1997) Machine Learning for NLP 30/06/2003
  • 7. ML4NLP Empirical NLP 90’s: Application of Machine Learning techniques (ML) to NLP problems • Lexical and structural ambiguity problems – Word selection (SR, MT) – Part-of-speech tagging Clasification – Semantic ambiguity (polysemy) problems – Prepositional phrase attachment – Reference ambiguity (anaphora) – etc. • What to read? Foundations of Statistical Language Processing (Manning & Schütze, 1999) Machine Learning for NLP 30/06/2003
  • 8. ML4NLP NLP “classification” problems • Ambiguity is a crucial problem for natural language understanding/processing. Ambiguity Resolution = Classification He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 9. ML4NLP NLP “classification” problems • Morpho-syntactic ambiguity He was shot in the hand as he chased the robbers in the back street JJ NN NN VB VB VB (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 10. ML4NLP NLP “classification” problems • Morpho-syntactic ambiguity: Part of Speech Tagging He was shot in the hand as he chased the robbers in the back street JJ NN NN VB VB VB (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 11. ML4NLP NLP “classification” problems • Semantic (lexical) ambiguity He was shot in the hand as he chased the robbers in body-part street the back clock-part (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 12. ML4NLP NLP “classification” problems • Semantic (lexical) ambiguity: Word Sense Disambiguation He was shot in the hand as he chased the robbers in body-part street the back clock-part (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 13. ML4NLP NLP “classification” problems • Structural (syntactic) ambiguity He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 14. ML4NLP NLP “classification” problems • Structural (syntactic) ambiguity He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 15. ML4NLP NLP “classification” problems • Structural (syntactic) ambiguity: PP-attachment disambiguation He was shot in the hand as he (chased (the robbers)NP (in the back street)PP) (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 16. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms in detail • Applications to NLP Machine Learning for NLP 30/06/2003
  • 17. Classification Feature Vector Classification IA perspective • An instance is a vector: x=<x1,…, xn> whose components, called features (or attributes), are discrete or real-valued. • Let X be the space of all possible instances. • Let Y={y1,…, ym} be the set of categories (or classes). • The goal is to learn an unknown target function, f : X Y • A training example is an instance x belonging to X, labelled with the correct value for f(x), i.e., a pair <x, f(x)> • Let D be the set of all training examples. Machine Learning for NLP 30/06/2003
  • 18. Classification Feature Vector Classification • The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions The goal is to find a function h belonging to H such that for all pair <x, f (x)> belonging to D, h(x) = f (x) Machine Learning for NLP 30/06/2003
  • 19. Classification An Example Example SIZE COLOR SHAPE CLASS 1 small red circle positive 2 big red circle positive 3 small red triangle negative 4 big blue circle negative Rules Decision Tree COLOR (COLOR=red) red blue (SHAPE=circle) positive SHAPE negative otherwise negative circle triangle positive negative Machine Learning for NLP 30/06/2003
  • 20. Classification An Example Example SIZE COLOR SHAPE CLASS 1 small red circle positive 2 big red circle positive 3 small red triangle negative 4 big blue circle negative Rules Decision Tree SIZE (SIZE=small) (SHAPE=circle) positive small big (SIZE=big) (COLOR=red) positive otherwise negative SHAPE COLOR circle triang red blue pos neg pos neg Machine Learning for NLP 30/06/2003
  • 21. Classification Some important concepts • Inductive Bias “Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias” (Mooney & Cardie, 99) – Language / Search bias Decision Tree COLOR red blue SHAPE negative circle triangle positive negative Machine Learning for NLP 30/06/2003
  • 22. Classification Some important concepts • Inductive Bias • Training error and generalization error • Generalization ability and overfitting • Batch Learning vs. on-line Leaning • Symbolic vs. statistical Learning • Propositional vs. first-order learning Machine Learning for NLP 30/06/2003
  • 23. Classification Propositional vs. Relational Learning • Propositional learning color(red) shape(circle) classA • Relational learning = ILP (induction of logic programs) course(X) person(Y) link_to(Y,X) instructor_of(X,Y) research_project(X) person(Z) link_to(L1,X,Y) link_to(L2,Y,Z) neighbour_word_people(L1) member_proj(X,Z) Machine Learning for NLP 30/06/2003
  • 24. Classification The Classification Setting Class, Point, Example, Data Set, ... CoLT/SLT • Input Space: X Rn perspective • (binary) Output Space: Y = {+1,-1} • A point, pattern or instance: x X, x = (x1, x2, …, xn) • Example: (x, y) with x X, y Y • Training Set: a set of m examples generated i.i.d. according to an unknown distribution P(x,y) S = {(x1, y1), …, (xm, ym)} (X Y)m Machine Learning for NLP 30/06/2003
  • 25. Classification The Classification Setting Learning, Error, ... • The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions. In SVM are of the form: n h( x) wi i (x) b i 1 • The goal is to find a function h belonging to H such that the expected misclassification error on new examples, also drawn from P(x,y), is minimal (Risk Minimization, RM) Machine Learning for NLP 30/06/2003
  • 26. Classification The Classification Setting Learning, Error, ... • Expected error (risk) Rh loss h(x), y dP (x, y ) • Problem: P itself is unknown. Known are training examples an induction principle is needed • Empirical Risk Minimization (ERM): Find the function h belonging to H for which the training error (empirical risk) is minimal m Remp h 1m i 1 loss h(x i ), yi Machine Learning for NLP 30/06/2003
  • 27. Classification The Classification Setting Error, Over(under)fitting,... • Low training error low true error? • The overfitting dilemma: Underfitting Overfitting • Trade-off between training error and complexity • Different learning biases can be used Machine Learning for NLP 30/06/2003
  • 28. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP Machine Learning for NLP 30/06/2003
  • 29. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms −Decision Trees −AdaBoost −Support Vector Machines • Applications to NLP Machine Learning for NLP 30/06/2003
  • 30. Algorithms Learning Paradigms • Statistical learning: – HMM, Bayesian Networks, ME, CRF, etc. • Traditional methods from Artificial Intelligence (ML, AI) – Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc. • Methods from Computational Learning Theory (CoLT/SLT) – Winnow, AdaBoost, SVM’s, etc. Machine Learning for NLP 30/06/2003
  • 31. Algorithms Learning Paradigms • Classifier combination: – Bagging, Boosting, Randomization, ECOC, Stacking, etc. • Semi-supervised learning: learning from labelled and unlabelled examples – Bootstrapping, EM, Transductive learning (SVM’s, AdaBoost), Co-Training, etc. • etc. Machine Learning for NLP 30/06/2003
  • 32. Algorithms Decision Trees • Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data. • They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization. • From a machine-learning perspective: Decision Trees are n -ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes Machine Learning for NLP 30/06/2003
  • 33. Algorithms Decision Trees • Acquisition: Top-Down Induction of Decision Trees (TDIDT) • Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87) (Kononenko et al. 95) etc. Machine Learning for NLP 30/06/2003
  • 34. Algorithms An Example A1 v1 v3 v2 A2 A3 ... A2 ... v4 v5 ... Decision Tree A5 A2 ... v6 SIZE small big A5 C3 SHAPE COLOR v7 circle triang red blue C1 C2 C1 pos neg pos neg Machine Learning for NLP 30/06/2003
  • 35. Algorithms Learning Decision Trees Training Training + TDIDT = Set DT Test Example + = Class DT Machine Learning for NLP 30/06/2003
  • 36. Algorithms General Induction Algorithm function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X) else amax := feature_selection (X,A); tree1 := create_tree (X, amax); for-all val in values (amax) do X’ := select_examples (X,amax,val); A’ := A - {amax}; tree2 := TDIDT (X’,A’); tree1 := add_branch (tree1,tree2,val) end-for end-if return (tree1) end-function Machine Learning for NLP 30/06/2003
  • 37. Algorithms General Induction Algorithm function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X) else amax := feature_selection (X,A); tree1 := create_tree (X, amax); for-all val in values (amax) do X’ := select_examples (X,amax,val); A’ := A - {amax}; tree2 := TDIDT (X’,A’); tree1 := add_branch (tree1,tree2,val) end-for end-if return (tree1) end-function Machine Learning for NLP 30/06/2003
  • 38. Algorithms Feature Selection Criteria • Functions derived from Information Theory: – Information Gain, Gain Ratio (Quinlan 86) • Functions derived from Distance Measures – Gini Diversity Index (Breiman et al. 84) – RLM (López de Mántaras 91) • Statistically-based – Chi-square test (Sestito & Dillon 94) – Symmetrical Tau (Zhou & Dillon 91) • RELIEFF-IG: variant of RELIEFF (Kononenko 94) Machine Learning for NLP 30/06/2003
  • 39. Algorithms Extensions of DTs (Murthy 95) • Pruning (pre/post) • Minimize the effect of the greedy approach: lookahead • Non-lineal splits • Combination of multiple models • Incremental learning (on-line) • etc. Machine Learning for NLP 30/06/2003
  • 40. Algorithms Decision Trees and NLP • Speech processing (Bahl et al. 89; Bakiri & Dietterich 99) • POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00) • Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96) • Parsing (Magerman 95,96; Haruno et al. 98,99) • Text categorization (Lewis & Ringuette 94; Weiss et al. 99) • Text summarization (Mani & Bloedorn 98) • Dialogue act tagging (Samuel et al. 98) Machine Learning for NLP 30/06/2003
  • 41. Algorithms Decision Trees and NLP • Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95) • Discourse analysis in information extraction (Soderland & Lehnert 94) • Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94) • Verb classification in Machine Translation (Tanaka 96; Siegel 97) Machine Learning for NLP 30/06/2003
  • 42. Algorithms Decision Trees: pros&cons • Advantages – Acquires symbolic knowledge in a understandable way – Very well studied ML algorithms and variants – Can be easily translated into rules – Existence of available software: C4.5, C5.0, etc. – Can be easily integrated into an ensemble Machine Learning for NLP 30/06/2003
  • 43. Algorithms Decision Trees: pros&cons • Drawbacks – Computationally expensive when scaling to large natural language domains: training examples, features, etc. – Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation – DTs is a model with high variance (unstable) – Tendency to overfit training data: pruning is necessary – Requires quite a big effort in tuning the model Machine Learning for NLP 30/06/2003
  • 44. Algorithms Boosting algorithms • Idea “to combine many simple and moderately accurate hypotheses (weak classifiers) into a single and highly accurate classifier” • AdaBoost (Freund & Schapire 95) has been theoretically and empirically studied extensively • Many other variants extensions (1997-2003) http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html Machine Learning for NLP 30/06/2003
  • 45. Algorithms AdaBoost: general scheme Linear F(h1,h2,...,hT) combination 2 ... h1 h2 hT Weak Weak Weak Learner Probability Learner Learner distribution updating TS1 TS2 ... TST D1 D2 DT Machine Learning for NLP 30/06/2003
  • 46. Algorithms AdaBoost: algorithm (Freund & Schapire 97) Machine Learning for NLP 30/06/2003
  • 47. Algorithms AdaBoost: example Weak hypotheses = vertical/horizontal hyperplanes Machine Learning for NLP 30/06/2003
  • 48. Algorithms AdaBoost: round 1 Machine Learning for NLP 30/06/2003
  • 49. Algorithms AdaBoost: round 2 Machine Learning for NLP 30/06/2003
  • 50. Algorithms AdaBoost: round 3 Machine Learning for NLP 30/06/2003
  • 51. Algorithms Combined Hypothesis www.research.att.com/~yoav/adaboost Machine Learning for NLP 30/06/2003
  • 52. Algorithms AdaBoost and NLP • POS Tagging (Abney et al. 99; Màrquez 99) • Text and Speech Categorization (Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99) • PP-attachment Disambiguation (Abney et al. 99) • Parsing (Haruno et al. 99) • Word Sense Disambiguation (Escudero et al. 00, 01) • Shallow parsing (Carreras & Màrquez, 01a; 02) • Email spam filtering (Carreras & Màrquez, 01b) • Term Extraction (Vivaldi, et al. 01) Machine Learning for NLP 30/06/2003
  • 53. Algorithms AdaBoost: pros&cons + Easy to implement and few parameters to set + Time and space grow linearly with number of examples. Ability to manage very large learning problems + Does not constrain explicitly the complexity of the learner + Naturally combines feature selection with learning + Has been succesfully applied to many practical problems Machine Learning for NLP 30/06/2003
  • 54. Algorithms AdaBoost: pros&cons ± Seems to be rather robust to overfitting (number of rounds) but sensitive to noise ± Performance is very good when there are relatively few relevant terms (features) – Can perform poorly when there is insufficient training data relative to the complexity of the base classifiers, the training errors of the base classifiers become too large too quickly Machine Learning for NLP 30/06/2003
  • 55. Algorithms SVM: A General Definition • “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000) Machine Learning for NLP 30/06/2003
  • 56. Algorithms SVM: A General Definition • “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000) Key Concepts Machine Learning for NLP 30/06/2003
  • 57. Algorithms Linear Classifiers • Hyperplanes in RN. • Defined by a weight vector (w) and a threshold (b). • They induce a classification rule: N N 1 if wi xi b 0 h(x) sign wi xi b i 1 i 1 1 otherwise + + + + + _ _ w _ _ + b _ _ _ w _ _ Machine Learning for NLP 30/06/2003
  • 58. Algorithms Optimal Hyperplane: Geometric Intuition Machine Learning for NLP 30/06/2003
  • 59. Algorithms Optimal Hyperplane: Geometric Intuition These are the Support Vectors Maximal Margin Hyperplane Machine Learning for NLP 30/06/2003
  • 60. Algorithms Linearly separable data Quadratic geometricmargin 2 / w 2 Programming 2 maximizing the margin is equivalent to minimize w subject to constraint s : yi ( w xi b) 1 for all i 1,, l Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  • 61. Algorithms Non-separable case (soft margin) 1 ,, l positiveslack vari ables for introducin costs g n 2 Minimize w C i subject toconstraint : s i 1 yi ( w xi b) 1 i for all i 1,, l i 0 for all i 1,, l Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  • 62. Algorithms Non-linear SVMs • Implicit mapping into feature space via kernel functions :X F Non-linear mapping n f ( x) wi i (x) b Set of hypotheses i 1 l f (x) i yi (xi ) (x) b Dual formulation i 1 K (x, z) (x) (z) Kernel function l f ( x) i yi K (xi , x) b Evaluation i 1 Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  • 63. Algorithms Non-linear SVMs • Kernel functions – Must be efficiently computable – Characterization via Mercer’s theorem – One of the curious facts about using a kernel is that we do not need to know the underlying feature map in order to be able to learn in the feature space! (Cristianini & Shawe-Taylor, 2000) – Examples: polynomials, Gaussian radial basis functions, two-layer sigmoidal neural networks, etc. Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  • 64. Algorithms Non linear SVMs Degree 3 polynomial kernel lin. separable lin. non-separable Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  • 65. Algorithms Toy Examples • All examples have been run with the 2D graphic interface of SVMLIB (Chang and Lin, National University of Taiwan) “LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. The basic algorithm is a simplification of both SMO by Platt and SVMLight by Joachims. It is also a simplification of the modification 2 of SMO by Keerthy et al. Our goal is to help users from other fields to easily use SVM as a tool. LIBSVM provides a simple interface where users can easily link it with their own programs…” • Available from: www.csie.ntu.edu.tw/~cjlin/libsvm (it icludes a Web integrated demo tool) Machine Learning for NLP 30/06/2003
  • 66. Algorithms Toy Examples (I) Linearly separable data set Linear SVM Maximal margin Hyperplane . What happens if we add a blue training example here? Machine Learning for NLP 30/06/2003
  • 67. Algorithms Toy Examples (I) (still) Linearly separable data set Linear SVM High value of C parameter Maximal margin Hyperplane The example is correctly classified Machine Learning for NLP 30/06/2003
  • 68. Algorithms Toy Examples (I) (still) Linearly separable data set Linear SVM Low value of C parameter Trade-off between: margin and training error The example is now a bounded SV Machine Learning for NLP 30/06/2003
  • 69. Algorithms Toy Examples (II) Machine Learning for NLP 30/06/2003
  • 70. Algorithms Toy Examples (II) Machine Learning for NLP 30/06/2003
  • 71. Algorithms Toy Examples (II) Machine Learning for NLP 30/06/2003
  • 72. Algorithms Toy Examples (III) Machine Learning for NLP 30/06/2003
  • 73. Algorithms SVM: Summary • SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik, 1992). Great developement since then • Kernel-induced feature spaces: SVMs work efficiently in very high dimensional feature spaces (+) • Learning bias: maximal margin optimisation. Reduces the danger of overfitting. Generalization bounds for SVMs (+) • Compact representation of the induced hypothesis. The solution is sparse in terms of SVs (+) Machine Learning for NLP 30/06/2003
  • 74. Algorithms SVM: Summary • Due to Mercer’s conditions on the kernels the optimi- sation problems are convex. No local minima (+) • Optimisation theory guides the implementation. Efficient learning (+) • Mainly for classification but also for regression, density estimation, clustering, etc. • Success in many real-world applications: OCR, vision, bioinformatics, speech recognition, NLP: TextCat, POS tagging, chunking, parsing, etc. (+) • Parameter tuning (–). Implications in convergence times, sparsity of the solution, etc. Machine Learning for NLP 30/06/2003
  • 75. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP Machine Learning for NLP 30/06/2003
  • 76. Applications NLP problems • Warning! We will not focus on final NLP applications, but on intermediate tasks... • We will classify the NLP tasks according to their (structural) complexity Machine Learning for NLP 30/06/2003
  • 77. Applications NLP problems: structural complexity • Decisional problems − Text Categorization, Document filtering, Word Sense Disambiguation, etc. • Sequence tagging and detection of sequential structures − POS tagging, Named Entity extraction, syntactic chunking, etc. • Hierarchical structures − Clause detection, full parsing, IE of complex concepts, composite Named Entities, etc. Machine Learning for NLP 30/06/2003
  • 78. Applications POS tagging • Morpho-syntactic ambiguity: Part of Speech Tagging He was shot in the hand as he chased the robbers in the back street JJ NN NN VB VB VB (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 79. Applications POS tagging “preposition-adverb” tree root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB others ... P(IN)=0.13 Probabilistic interpretation: P(RB)=0.87 tag(+2) ^ P( RB | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.987 IN ^ P( IN | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.013 P(IN)=0.013 leaf P(RB)=0.987 Machine Learning for NLP 30/06/2003
  • 80. Applications POS tagging “preposition-adverb” tree root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB Collocations: others “as_RB much_RB as_IN” ... P(IN)=0.13 P(RB)=0.87 tag(+2) “as_RB soon_RB as_IN” IN “as_RB well_RB as_IN” P(IN)=0.013 leaf P(RB)=0.987 Machine Learning for NLP 30/06/2003
  • 81. Applications POS tagging RTT (Màrquez & Rodríguez 97) Language Model A Sequential Model for Multi-class Classification: NLP/POS Tagging (Even-Zohar & Roth, 01) stop? Morphological Tagged Raw analysis Classify Update Filter text yes text no Disambiguation Machine Learning for NLP 30/06/2003
  • 82. Applications POS tagging STT (Màrquez & Rodríguez 97) Language Model Lexical probs. + The Use of Classifiers in sequential inference: Contextual probs. Chunking (Punyakanok & Roth, 00) Raw Morphological Viterbi Tagged analysis algorithm text text Disambiguation Machine Learning for NLP 30/06/2003
  • 83. Applications Detection of sequential and hierarchical structures • Named Entity recognition • Clause detection Machine Learning for NLP 30/06/2003
  • 84. Conclusions Summary/conclusions • We have briefly outlined: −The ML setting: “supervised learning for classification” −Three concrete machine learning algorithms −How to apply them to solve itermediate NLP tasks Machine Learning for NLP 30/06/2003
  • 85. Conclusions Summary/conclusions • Any ML algorithm for NLP should be: – Robust to noise and outliers – Efficient in large feature/example spaces – Adaptive to new/changing domains: portability, tuning, etc. – Able to take advantage of unlabelled examples: semi-supervised learning Machine Learning for NLP 30/06/2003
  • 86. Conclusions Summary/conclusions • Statistical and ML-based Natural Language Processing is a very active and multidisciplinary area of research Machine Learning for NLP 30/06/2003
  • 87. Conclusions Some current research lines • Appropriate learning paradigm for all kind of NLP problems: TiMBL (DBZ99), TBEDL (Brill95), ME (Ratnaparkhi98), SNoW (Roth98), CRF (Pereira & Singer02). • Definition of an adequate (and task-specific) feature space: mapping from the input space to a high dimensional feature space, kernels, etc. • Resolution of complex NLP problems: inference with classifiers + constraint satisfaction • etc. Machine Learning for NLP 30/06/2003
  • 88. Conclusions Bibliografia • You may found additional information at: http://www.lsi.upc.es/~lluism/ tesi.html publicacions/pubs.html cursos/talks.html cursos/MLandNL.html cursos/emnlp1.html • This talk at: http://www.lsi.upc.es/~lluism/udg03.ppt.gz Machine Learning for NLP 30/06/2003
  • 89. Seminar: Statistical NLP Machine Learning for Natural Language Processing Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Girona, June 2003 Machine Learning for NLP 30/06/2003