SlideShare a Scribd company logo
1 of 89
Seminar: Statistical NLP

   Machine Learning for
Natural Language Processing
              Lluís Màrquez
            TALP Research Center
      Llenguatges i Sistemes Informàtics
      Universitat Politècnica de Catalunya

              Girona, June 2003

               Machine Learning for NLP      30/06/2003
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
• Applications to NLP

          Machine Learning for NLP   30/06/2003
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
• Applications to NLP

          Machine Learning for NLP   30/06/2003
               Machine Learning
   • There are many general-purpose definitions of Machine
     Learning (or artificial learning):

         Making a computer automatically acquire some
         kind of knowledge from a concrete data domain

   • Learners are computers: we study learning algorithms
   • Resources are scarce: time, memory, data, etc.
   • It has (almost) nothing to do with: Cognitive science,
     neuroscience, theory of scientific discovery and research, etc.
   • Biological plausibility is welcome but not the main goal

                           Machine Learning for NLP               30/06/2003
                Machine Learning
   • Learning... but what for?
         – To perform some particular task
         – To react to environmental inputs
         – Concept learning from data:
            • modelling concepts underlying data
            • predicting unseen observations
            • compacting the knowledge representation
            • knowledge discovery for expert systems

   • We will concentrate on:
         – Supervised inductive learning for classification
           = discriminative learning

                             Machine Learning for NLP         30/06/2003
             Machine Learning

     A more precise definition:

       Obtaining a description of the concept in
     some representation language that explains
        observations and helps predicting new
          instances of the same distribution

     • What to read?
         – Machine Learning (Mitchell, 1997)

                        Machine Learning for NLP   30/06/2003
                       Empirical NLP
    90’s: Application of Machine Learning techniques
          (ML) to NLP problems

    • Lexical and structural ambiguity problems
         –   Word selection (SR, MT)
         –   Part-of-speech tagging
         –   Semantic ambiguity (polysemy)
         –   Prepositional phrase attachment
         –   Reference ambiguity (anaphora)
         –   etc.

    • What to read? Foundations of Statistical Language
         Processing   (Manning & Schütze, 1999)

                             Machine Learning for NLP                   30/06/2003

     NLP “classification” problems
     • Ambiguity is a crucial problem for natural
       language understanding/processing.
       Ambiguity Resolution = Classification

         He was shot in the hand as he chased
            the robbers in the back street

                               (The Wall Street Journal Corpus)

                      Machine Learning for NLP               30/06/2003

     NLP “classification” problems

     • Morpho-syntactic ambiguity

         He was shot in the hand as he chased
            the robbers in the back street JJ
                 NN          NN
                 VB                 VB               VB

                               (The Wall Street Journal Corpus)

                      Machine Learning for NLP               30/06/2003

     NLP “classification” problems

     • Morpho-syntactic ambiguity:
       Part of Speech Tagging

         He was shot in the hand as he chased
            the robbers in the back street JJ
                 NN          NN
                 VB                 VB               VB

                               (The Wall Street Journal Corpus)

                      Machine Learning for NLP               30/06/2003

     NLP “classification” problems

     • Semantic (lexical) ambiguity

         He was shot in the hand as he chased
            the robbers in body-part street
                            the back

                               (The Wall Street Journal Corpus)

                      Machine Learning for NLP               30/06/2003

     NLP “classification” problems

     • Semantic (lexical) ambiguity:
       Word Sense Disambiguation

         He was shot in the hand as he chased
            the robbers in body-part street
                            the back

                               (The Wall Street Journal Corpus)

                      Machine Learning for NLP               30/06/2003

     NLP “classification” problems

     • Structural (syntactic) ambiguity

         He was shot in the hand as he chased
            the robbers in the back street

                               (The Wall Street Journal Corpus)

                      Machine Learning for NLP               30/06/2003

     NLP “classification” problems

     • Structural (syntactic) ambiguity

         He was shot in the hand as he chased
            the robbers in the back street

                               (The Wall Street Journal Corpus)

                      Machine Learning for NLP               30/06/2003

     NLP “classification” problems

     • Structural (syntactic) ambiguity:
       PP-attachment disambiguation

         He was shot in the hand as he (chased
         (the robbers)NP (in the back street)PP)

                               (The Wall Street Journal Corpus)

                      Machine Learning for NLP               30/06/2003
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms in detail
• Applications to NLP

          Machine Learning for NLP   30/06/2003

         Feature Vector Classification
      • An instance is a vector: x=<x1,…, xn> whose components,
        called features (or attributes), are discrete or real-valued.

      • Let X be the space of all possible instances.

      • Let Y={y1,…, ym} be the set of categories (or classes).

      • The goal is to learn an unknown target function, f : X        Y
      • A training example is an instance x belonging to X, labelled
         with the correct value for f(x), i.e., a pair <x, f(x)>

      • Let D be the set of all training examples.

                              Machine Learning for NLP                 30/06/2003

         Feature Vector Classification

    • The hypotheses space, H, is the set of functions h: X   Y
      that the learner can consider as possible definitions

          The goal is to find a function h belonging to H
          such that for all pair <x, f (x)> belonging to D,
                             h(x) = f (x)

                           Machine Learning for NLP           30/06/2003
                           An Example
                 Example    SIZE      COLOR      SHAPE           CLASS
                    1       small      red        circle         positive
                    2       big        red        circle         positive
                    3       small      red       triangle    negative
                    4       big        blue       circle     negative

             Rules                                   Decision Tree
                                                           red              blue
       (SHAPE=circle)      positive
                                                         SHAPE          negative
          otherwise        negative
                                               circle            triangle

                                              positive      negative

                              Machine Learning for NLP                         30/06/2003
                           An Example
                 Example     SIZE     COLOR         SHAPE        CLASS
                    1        small     red           circle      positive
                   2          big          red      circle       positive
                   3         small         red     triangle      negative
                   4          big       blue        circle       negative

              Rules                                       Decision Tree
(SIZE=small) (SHAPE=circle)     positive
                                                          small             big
 (SIZE=big) (COLOR=red)         positive
                 otherwise      negative                 SHAPE              COLOR
                                                 circle       triang red           blue

                                                   pos        neg        pos       neg

                               Machine Learning for NLP                           30/06/2003

         Some important concepts
     • Inductive Bias
        “Any means that a classification learning system uses to choose
        between to functions that are both consistent with the training
        data is called inductive bias” (Mooney & Cardie, 99)
         – Language / Search bias

                                 Decision Tree

                                      red              blue

                                      SHAPE           negative

                            circle            triangle

                           positive     negative

                             Machine Learning for NLP             30/06/2003

         Some important concepts
     • Inductive Bias

     • Training error and generalization error

     • Generalization ability and overfitting

     • Batch Learning vs. on-line Leaning

     • Symbolic vs. statistical Learning

     • Propositional vs. first-order learning

                        Machine Learning for NLP   30/06/2003

                     Propositional vs.
                    Relational Learning
    • Propositional learning

                 color(red)      shape(circle)              classA

    • Relational learning = ILP (induction of logic programs)

        course(X)    person(Y)     link_to(Y,X)          instructor_of(X,Y)

    research_project(X) person(Z) link_to(L1,X,Y)
    link_to(L2,Y,Z) neighbour_word_people(L1)    member_proj(X,Z)

                              Machine Learning for NLP                    30/06/2003
            The Classification Setting
             Class, Point, Example, Data Set, ...
         • Input Space: X      Rn                         perspective

         • (binary) Output Space: Y = {+1,-1}
         • A point, pattern or instance:
                    x   X, x = (x1, x2, …, xn)
         • Example: (x, y) with x            X, y    Y
         • Training Set: a set of m examples generated i.i.d.
           according to an unknown distribution P(x,y)
                 S = {(x1, y1), …, (xm, ym)}         (X    Y)m

                          Machine Learning for NLP               30/06/2003
            The Classification Setting
                     Learning, Error, ...
         • The hypotheses space, H, is the set of functions
           h: X Y that the learner can consider as possible
           definitions. In SVM are of the form:
                       h( x)         wi i (x) b
                               i 1

         • The goal is to find a function h belonging to H such
           that the expected misclassification error on new
           examples, also drawn from P(x,y), is minimal
           (Risk Minimization, RM)

                         Machine Learning for NLP        30/06/2003
            The Classification Setting
                     Learning, Error, ...
        • Expected error (risk)

                   Rh       loss h(x), y dP (x, y )

        • Problem: P itself is unknown. Known are training
          examples   an induction principle is needed
        • Empirical Risk Minimization (ERM): Find the
          function h belonging to H for which the training
          error (empirical risk) is minimal
                  Remp h    1m      i 1
                                          loss h(x i ), yi

                           Machine Learning for NLP          30/06/2003
            The Classification Setting
                 Error, Over(under)fitting,...
        • Low training error     low true error?
        • The overfitting dilemma:

                          Underfitting              Overfitting
        • Trade-off between training error and complexity
        • Different learning biases can be used
                         Machine Learning for NLP            30/06/2003
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
• Applications to NLP

          Machine Learning for NLP   30/06/2003
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
  −Decision Trees
  −Support Vector Machines

• Applications to NLP
          Machine Learning for NLP   30/06/2003
                 Learning Paradigms

       • Statistical learning:
             – HMM, Bayesian Networks, ME, CRF, etc.
       • Traditional methods from Artificial Intelligence
         (ML, AI)
             – Decision trees/lists, exemplar-based learning, rule
               induction, neural networks, etc.

       • Methods from Computational Learning
         Theory (CoLT/SLT)
             – Winnow, AdaBoost, SVM’s, etc.

                           Machine Learning for NLP         30/06/2003
              Learning Paradigms

       • Classifier combination:
             – Bagging, Boosting, Randomization, ECOC,
               Stacking, etc.

       • Semi-supervised learning: learning from
         labelled and unlabelled examples
             – Bootstrapping, EM, Transductive learning
               (SVM’s, AdaBoost), Co-Training, etc.

       • etc.

                          Machine Learning for NLP        30/06/2003
                  Decision Trees
  • Decision trees are a way to represent rules underlying
    training data, with hierarchical structures that
    recursively partition the data.

  • They have been used by many research communities
    (Pattern Recognition, Statistics, ML, etc.) for data
    exploration with some of the following purposes:
    Description, Classification, and Generalization.

  • From a machine-learning perspective: Decision Trees
    are n -ary branching trees that represent classification
    rules for classifying the objects of a certain domain into
    a set of mutually exclusive classes

                        Machine Learning for NLP         30/06/2003
                    Decision Trees
     • Acquisition:
       Top-Down Induction of Decision Trees
     • Systems:
             CART (Breiman et al. 84),
             ID3, C4.5, C5.0 (Quinlan 86,93,98),
             ASSISTANT, ASSISTANT-R (Cestnik et al. 87)
             (Kononenko et al. 95)

                           Machine Learning for NLP       30/06/2003
                                An Example
                                                v1                    v3
                                         A2          A3         ...         A2

                                         ...    v4         v5               ...
           Decision Tree                        A5             A2

                                               ...        v6
           small            big
                                                          A5           C3

          SHAPE            COLOR               v7
  circle      triang red          blue
                                         C1          C2         C1
    pos       neg         pos     neg

                                  Machine Learning for NLP                        30/06/2003
         Learning Decision Trees

                            + TDIDT =


                 Example     +              = Class

                            Machine Learning for NLP   30/06/2003

         General Induction Algorithm
                                       function TDIDT (X:set-of-examples; A:set-of-features)
                                         var: tree1,tree2: decision-tree;
                                               X’: set-of-examples;
                                               A’: set-of-features
                                         if (stopping_criterion (X)) then
                                            tree1 := create_leaf_tree (X)
                                            amax := feature_selection (X,A);
                                            tree1 := create_tree (X, amax);
                                            for-all val in values (amax) do
                                              X’ := select_examples (X,amax,val);
                                              A’ := A - {amax};
                                              tree2 := TDIDT (X’,A’);
                                              tree1 := add_branch (tree1,tree2,val)
                                         return (tree1)

                                                  Machine Learning for NLP                     30/06/2003

         General Induction Algorithm
                                       function TDIDT (X:set-of-examples; A:set-of-features)
                                         var: tree1,tree2: decision-tree;
                                               X’: set-of-examples;
                                               A’: set-of-features
                                         if (stopping_criterion (X)) then
                                            tree1 := create_leaf_tree (X)
                                            amax := feature_selection (X,A);
                                            tree1 := create_tree (X, amax);
                                            for-all val in values (amax) do
                                              X’ := select_examples (X,amax,val);
                                              A’ := A - {amax};
                                              tree2 := TDIDT (X’,A’);
                                              tree1 := add_branch (tree1,tree2,val)
                                         return (tree1)

                                                  Machine Learning for NLP                     30/06/2003

             Feature Selection Criteria
     • Functions derived from Information Theory:
        – Information Gain, Gain Ratio (Quinlan 86)

     • Functions derived from Distance Measures
        – Gini Diversity Index (Breiman et al. 84)
        – RLM (López de Mántaras 91)

     • Statistically-based
        – Chi-square test (Sestito & Dillon 94)
        – Symmetrical Tau (Zhou & Dillon 91)

     • RELIEFF-IG: variant of RELIEFF (Kononenko 94)

                          Machine Learning for NLP     30/06/2003
                Extensions of DTs
                                      (Murthy 95)

      • Pruning (pre/post)
      • Minimize the effect of the greedy approach:
      • Non-lineal splits
      • Combination of multiple models
      • Incremental learning (on-line)
      • etc.

                       Machine Learning for NLP     30/06/2003
             Decision Trees and NLP
    • Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)
    • POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez &
      Rodríguez 95,97; Màrquez et al. 00)

    • Word sense disambiguation (Brown et al. 91; Cardie 93;
      Mooney 96)

    • Parsing (Magerman 95,96; Haruno et al. 98,99)
    • Text categorization (Lewis & Ringuette 94; Weiss et al. 99)
    • Text summarization (Mani & Bloedorn 98)
    • Dialogue act tagging (Samuel et al. 98)

                           Machine Learning for NLP            30/06/2003
             Decision Trees and NLP
     • Noun phrase coreference
        (Aone & Benett 95; Mc Carthy & Lehnert 95)

     • Discourse analysis in information extraction
        (Soderland & Lehnert 94)

     • Cue phrase identification in text and speech
        (Litman 94; Siegel & McKeown 94)

     • Verb classification in Machine Translation
        (Tanaka 96; Siegel 97)

                           Machine Learning for NLP   30/06/2003

        Decision Trees: pros&cons

    • Advantages
       – Acquires symbolic knowledge in a
         understandable way
       – Very well studied ML algorithms and variants
       – Can be easily translated into rules
       – Existence of available software: C4.5, C5.0, etc.
       – Can be easily integrated into an ensemble

                         Machine Learning for NLP        30/06/2003

        Decision Trees: pros&cons
  • Drawbacks
     – Computationally expensive when scaling to large
       natural language domains: training examples,
       features, etc.
     – Data sparseness and data fragmentation: the problem
       of the small disjuncts => Probability estimation
     – DTs is a model with high variance (unstable)
     – Tendency to overfit training data: pruning is necessary
     – Requires quite a big effort in tuning the model

                        Machine Learning for NLP         30/06/2003
             Boosting algorithms
   • Idea
     “to combine many simple and moderately accurate
      hypotheses (weak classifiers) into a single and highly
      accurate classifier”

   • AdaBoost (Freund & Schapire 95) has been
     theoretically and empirically studied extensively

   • Many other variants extensions (1997-2003)

                       Machine Learning for NLP        30/06/2003
             AdaBoost: general scheme


               h1             h2                                 hT

              Weak           Weak                                Weak
             Learner                        Probability         Learner

              TS1            TS2
                                              ...                TST
        D1             D2                                  DT
                            Machine Learning for NLP                      30/06/2003
             AdaBoost: algorithm

                           (Freund & Schapire 97)

                  Machine Learning for NLP          30/06/2003
             AdaBoost: example

    Weak hypotheses = vertical/horizontal hyperplanes

                      Machine Learning for NLP      30/06/2003
             AdaBoost: round 1

                  Machine Learning for NLP   30/06/2003
             AdaBoost: round 2

                  Machine Learning for NLP   30/06/2003
             AdaBoost: round 3

                  Machine Learning for NLP   30/06/2003
             Combined Hypothesis

                    Machine Learning for NLP   30/06/2003
                AdaBoost and NLP
      • POS Tagging (Abney et al. 99; Màrquez 99)
      • Text and Speech Categorization
         (Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99)

      • PP-attachment Disambiguation (Abney et al. 99)
      • Parsing (Haruno et al. 99)
      • Word Sense Disambiguation (Escudero et al. 00, 01)
      • Shallow parsing (Carreras & Màrquez, 01a; 02)
      • Email spam filtering (Carreras & Màrquez, 01b)
      • Term Extraction (Vivaldi, et al. 01)

                           Machine Learning for NLP                30/06/2003

             AdaBoost: pros&cons
       + Easy to implement and few parameters to set
       + Time and space grow linearly with number of
         examples. Ability to manage very large learning
       + Does not constrain explicitly the complexity of the
       + Naturally combines feature selection with learning
       + Has been succesfully applied to many practical

                        Machine Learning for NLP          30/06/2003

             AdaBoost: pros&cons

       ± Seems to be rather robust to overfitting
         (number of rounds) but sensitive to noise

       ± Performance is very good when there are
         relatively few relevant terms (features)

       – Can perform poorly when there is insufficient
         training data relative to the complexity of the
         base classifiers, the training errors of the base
         classifiers become too large too quickly

                        Machine Learning for NLP             30/06/2003

             SVM: A General Definition
      • “Support Vector Machines (SVM) are learning
        systems that use a hypothesis space of linear
        functions in a high dimensional feature space,
        trained with a learning algorithm from optimisation
        theory that implements a learning bias derived
        from statistical learning theory”.
        (Cristianini & Shawe-Taylor, 2000)

                           Machine Learning for NLP     30/06/2003

         SVM: A General Definition

      • “Support Vector Machines (SVM) are learning
        systems that use a hypothesis space of linear
        functions in a high dimensional feature space,
        trained with a learning algorithm from optimisation
        theory that implements a learning bias derived
        from statistical learning theory”.
        (Cristianini & Shawe-Taylor, 2000)
                                                      Key Concepts

                           Machine Learning for NLP            30/06/2003
                Linear Classifiers
    • Hyperplanes in RN.
    • Defined by a weight vector (w) and a threshold (b).
    • They induce a classification rule:

                                          1 if          wi xi b   0
        h(x) sign         wi xi b                 i 1
                    i 1
                                          1           otherwise
                     +                        +
                           +                      +
                    _    _            w
                           _          _     +
                      b _
                           _               _
                      w               _

                            Machine Learning for NLP                  30/06/2003
             Optimal Hyperplane:
             Geometric Intuition

                  Machine Learning for NLP   30/06/2003
              Optimal Hyperplane:
              Geometric Intuition
   These are the

                   Machine Learning for NLP       30/06/2003
             Linearly separable data

       geometricmargin 2 / w 2                                   Programming
       maximizing the margin is equivalent to minimize w subject to constraint s :

                        yi ( w xi   b) 1 for all i 1,, l

Seminari SVMs                   Machine Learning for NLP                     22/05/2001
       Non-separable case (soft margin)

              1   ,,   l    positiveslack vari
                                              ables for introducin costs
             Minimize w            C         i   subject toconstraint :
                                       i 1

                   yi ( w xi       b) 1             i   for all i 1,, l
                    i       0 for all i 1,, l

Seminari SVMs                          Machine Learning for NLP            22/05/2001
                      Non-linear SVMs
    • Implicit mapping into feature space via kernel functions

                :X          F        Non-linear mapping
             f ( x)           wi i (x) b         Set of hypotheses
                       i 1
             f (x)              i   yi (xi )    (x)     b Dual formulation
                       i 1

             K (x, z)               (x)   (z)      Kernel function
             f ( x)             i   yi K (xi , x) b     Evaluation
                        i 1

Seminari SVMs                         Machine Learning for NLP          22/05/2001
                   Non-linear SVMs
    • Kernel functions
       – Must be efficiently computable

       – Characterization via Mercer’s theorem

       – One of the curious facts about using a kernel is
             that we do not need to know the underlying
             feature map in order to be able to learn in the
             feature space! (Cristianini & Shawe-Taylor, 2000)
       – Examples: polynomials, Gaussian radial basis
         functions, two-layer sigmoidal neural networks,

Seminari SVMs               Machine Learning for NLP         22/05/2001

                Non linear SVMs
                  Degree 3 polynomial kernel

             lin. separable                     lin. non-separable

Seminari SVMs                 Machine Learning for NLP           22/05/2001

                      Toy Examples
      • All examples have been run with the 2D graphic
        interface of SVMLIB (Chang and Lin, National University
         of Taiwan)
         “LIBSVM is an integrated software for support vector classification,
         (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution
         estimation (one-class SVM). It supports multi-class classification. The
         basic algorithm is a simplification of both SMO by Platt and SVMLight
         by Joachims. It is also a simplification of the modification 2 of SMO by
         Keerthy et al. Our goal is to help users from other fields to easily use
         SVM as a tool. LIBSVM provides a simple interface where users can
         easily link it with their own programs…”

      • Available from:
        (it icludes a Web integrated demo tool)

                               Machine Learning for NLP                     30/06/2003
             Toy Examples (I)

                                    Linearly separable data set
                                    Linear SVM
                                    Maximal margin Hyperplane

        .             What happens if we add
                      a blue training example

                 Machine Learning for NLP                30/06/2003
             Toy Examples (I)

                                    (still) Linearly separable
                                    data set
                                    Linear SVM
                                    High value of C parameter
                                    Maximal margin Hyperplane

                     The example is
                    correctly classified

                 Machine Learning for NLP                   30/06/2003
             Toy Examples (I)

                                    (still) Linearly separable
                                    data set
                                    Linear SVM
                                    Low value of C parameter
                                    Trade-off between: margin
                                    and training error

                     The example is
                    now a bounded SV

                 Machine Learning for NLP                   30/06/2003
             Toy Examples (II)

                  Machine Learning for NLP   30/06/2003
             Toy Examples (II)

                  Machine Learning for NLP   30/06/2003
             Toy Examples (II)

                  Machine Learning for NLP   30/06/2003
             Toy Examples (III)

                  Machine Learning for NLP   30/06/2003

                SVM: Summary
     • SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik,
       1992). Great developement since then

     • Kernel-induced feature spaces: SVMs work efficiently
       in very high dimensional feature spaces (+)

     • Learning bias: maximal margin optimisation.
       Reduces the danger of overfitting. Generalization
       bounds for SVMs (+)

     • Compact representation of the induced hypothesis.
       The solution is sparse in terms of SVs (+)

                        Machine Learning for NLP       30/06/2003

                 SVM: Summary
     • Due to Mercer’s conditions on the kernels the optimi-
       sation problems are convex. No local minima (+)
     • Optimisation theory guides the implementation.
       Efficient learning (+)
     • Mainly for classification but also for regression,
       density estimation, clustering, etc.
     • Success in many real-world applications: OCR, vision,
       bioinformatics, speech recognition, NLP: TextCat, POS
       tagging, chunking, parsing, etc. (+)
     • Parameter tuning (–). Implications in convergence
       times, sparsity of the solution, etc.

                         Machine Learning for NLP           30/06/2003
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
• Applications to NLP

          Machine Learning for NLP   30/06/2003

                NLP problems

        • Warning! We will not focus on
          final NLP applications, but on
          intermediate tasks...

        • We will classify the NLP tasks
          according to their (structural)

                     Machine Learning for NLP   30/06/2003
          NLP problems: structural
       • Decisional problems
          − Text Categorization, Document filtering, Word
            Sense Disambiguation, etc.
       • Sequence tagging and detection of
         sequential structures
          − POS tagging, Named Entity extraction,
            syntactic chunking, etc.
       • Hierarchical structures
          − Clause detection, full parsing, IE of complex
            concepts, composite Named Entities, etc.
                         Machine Learning for NLP           30/06/2003

                  POS tagging
       • Morpho-syntactic ambiguity:
         Part of Speech Tagging

         He was shot in the hand as he chased
            the robbers in the back street JJ
                 NN          NN
                   VB                 VB               VB

                                 (The Wall Street Journal Corpus)

                        Machine Learning for NLP               30/06/2003

                          POS tagging
                            “preposition-adverb” tree
                                            Word Form
                                             ...              P(IN)=0.83
                                                                ...         P(IN)=0.13
Probabilistic interpretation:                                               P(RB)=0.87
P( RB | word=“A/as”   tag(+1)=RB    tag(+2)=IN) = 0.987               IN
P( IN | word=“A/as”   tag(+1)=RB    tag(+2)=IN) = 0.013

                                   Machine Learning for NLP                         30/06/2003

                     POS tagging
                      “preposition-adverb” tree
                                   Word Form
                                    ...              P(IN)=0.83
       Collocations:                        others

       “as_RB much_RB as_IN”
                                                       ...         P(IN)=0.13
       “as_RB soon_RB as_IN”                                 IN

       “as_RB well_RB as_IN”

                          Machine Learning for NLP                         30/06/2003
                         POS tagging
                                       RTT (Màrquez & Rodríguez 97)


           A Sequential Model for Multi-class Classification:
               NLP/POS Tagging (Even-Zohar & Roth, 01)
         Morphological                                                     Tagged
                         Classify     Update        Filter
 text                                                             yes        text


                             Machine Learning for NLP                   30/06/2003
                               POS tagging
                                         STT (Màrquez & Rodríguez 97)

                                          Language Model

                                    probs.      +
                The Use of Classifiers in sequential inference:
                                            Contextual probs.
                      Chunking (Punyakanok & Roth, 00)

   Raw         Morphological                      Viterbi                  Tagged
                 analysis                       algorithm
   text                                                                      text

                                 Machine Learning for NLP               30/06/2003

       Detection of sequential and
         hierarchical structures

           • Named Entity recognition
           • Clause detection

                     Machine Learning for NLP   30/06/2003

      • We have briefly outlined:
         −The ML setting: “supervised learning for
         −Three concrete machine learning
         −How to apply them to solve itermediate
          NLP tasks

                     Machine Learning for NLP   30/06/2003

      • Any ML algorithm for NLP should be:
         – Robust to noise and outliers
         – Efficient in large feature/example spaces
         – Adaptive to new/changing domains:
           portability, tuning, etc.
         – Able to take advantage of unlabelled
           examples: semi-supervised learning

                      Machine Learning for NLP    30/06/2003

      • Statistical and ML-based Natural
        Language Processing is a very active
        and multidisciplinary area of research

                    Machine Learning for NLP   30/06/2003

      Some current research lines
     • Appropriate learning paradigm for all kind of
       NLP problems: TiMBL (DBZ99), TBEDL (Brill95), ME
        (Ratnaparkhi98), SNoW (Roth98), CRF (Pereira & Singer02).

     • Definition of an adequate (and task-specific)
       feature space: mapping from the input space to a
        high dimensional feature space, kernels, etc.

     • Resolution of complex NLP problems:
        inference with classifiers + constraint satisfaction

     • etc.

                          Machine Learning for NLP           30/06/2003
     • You may found additional information at:

     • This talk at:

                       Machine Learning for NLP     30/06/2003
Seminar: Statistical NLP

   Machine Learning for
Natural Language Processing
              Lluís Màrquez
            TALP Research Center
      Llenguatges i Sistemes Informàtics
      Universitat Politècnica de Catalunya

              Girona, June 2003

               Machine Learning for NLP      30/06/2003

More Related Content

Viewers also liked

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
Why Now Is The Time For NLP
Why Now Is The Time For NLPWhy Now Is The Time For NLP
Why Now Is The Time For NLPLinda Ferguson
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersRoelof Pieters
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
neuro-linguistic programming
neuro-linguistic programmingneuro-linguistic programming
neuro-linguistic programmingMichael Buckley
Neuro Linguistic Programming
Neuro Linguistic ProgrammingNeuro Linguistic Programming
Neuro Linguistic Programmingsmjk
How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...
How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...
How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...Health Catalyst
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...Maarten van Wesel
My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...Ahmed Mater
Authorship attribution
Authorship attributionAuthorship attribution
Authorship attributionReza Ramezani
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaTraian Rebedea

Viewers also liked (20)

NLP for business analysts
NLP for business analystsNLP for business analysts
NLP for business analysts
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Deeplearning NLP
Deeplearning NLPDeeplearning NLP
Deeplearning NLP
Why Now Is The Time For NLP
Why Now Is The Time For NLPWhy Now Is The Time For NLP
Why Now Is The Time For NLP
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
Neuro linguistic programming(nlp)
Neuro linguistic programming(nlp)Neuro linguistic programming(nlp)
Neuro linguistic programming(nlp)
neuro-linguistic programming
neuro-linguistic programmingneuro-linguistic programming
neuro-linguistic programming
NLP for project managers
NLP for project managersNLP for project managers
NLP for project managers
Neuro Linguistic Programming
Neuro Linguistic ProgrammingNeuro Linguistic Programming
Neuro Linguistic Programming
How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...
How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...
How to Use Text Analytics in Healthcare to Improve Outcomes: Why You Need Mor...
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Develope yourself nlp
Develope yourself nlpDevelope yourself nlp
Develope yourself nlp
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...
Authorship attribution
Authorship attributionAuthorship attribution
Authorship attribution
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora

More from butest

1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
Facebook Facebook
Facebook butest
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest

More from butest (20)

1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
Facebook Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc

Machine Learning for NLP

  • 1. Seminar: Statistical NLP Machine Learning for Natural Language Processing Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Girona, June 2003 Machine Learning for NLP 30/06/2003
  • 2. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP Machine Learning for NLP 30/06/2003
  • 3. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP Machine Learning for NLP 30/06/2003
  • 4. ML4NLP Machine Learning • There are many general-purpose definitions of Machine Learning (or artificial learning): Making a computer automatically acquire some kind of knowledge from a concrete data domain • Learners are computers: we study learning algorithms • Resources are scarce: time, memory, data, etc. • It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc. • Biological plausibility is welcome but not the main goal Machine Learning for NLP 30/06/2003
  • 5. ML4NLP Machine Learning • Learning... but what for? – To perform some particular task – To react to environmental inputs – Concept learning from data: • modelling concepts underlying data • predicting unseen observations • compacting the knowledge representation • knowledge discovery for expert systems • We will concentrate on: – Supervised inductive learning for classification = discriminative learning Machine Learning for NLP 30/06/2003
  • 6. ML4NLP Machine Learning A more precise definition: Obtaining a description of the concept in some representation language that explains observations and helps predicting new instances of the same distribution • What to read? – Machine Learning (Mitchell, 1997) Machine Learning for NLP 30/06/2003
  • 7. ML4NLP Empirical NLP 90’s: Application of Machine Learning techniques (ML) to NLP problems • Lexical and structural ambiguity problems – Word selection (SR, MT) – Part-of-speech tagging Clasification – Semantic ambiguity (polysemy) problems – Prepositional phrase attachment – Reference ambiguity (anaphora) – etc. • What to read? Foundations of Statistical Language Processing (Manning & Schütze, 1999) Machine Learning for NLP 30/06/2003
  • 8. ML4NLP NLP “classification” problems • Ambiguity is a crucial problem for natural language understanding/processing. Ambiguity Resolution = Classification He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 9. ML4NLP NLP “classification” problems • Morpho-syntactic ambiguity He was shot in the hand as he chased the robbers in the back street JJ NN NN VB VB VB (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 10. ML4NLP NLP “classification” problems • Morpho-syntactic ambiguity: Part of Speech Tagging He was shot in the hand as he chased the robbers in the back street JJ NN NN VB VB VB (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 11. ML4NLP NLP “classification” problems • Semantic (lexical) ambiguity He was shot in the hand as he chased the robbers in body-part street the back clock-part (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 12. ML4NLP NLP “classification” problems • Semantic (lexical) ambiguity: Word Sense Disambiguation He was shot in the hand as he chased the robbers in body-part street the back clock-part (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 13. ML4NLP NLP “classification” problems • Structural (syntactic) ambiguity He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 14. ML4NLP NLP “classification” problems • Structural (syntactic) ambiguity He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 15. ML4NLP NLP “classification” problems • Structural (syntactic) ambiguity: PP-attachment disambiguation He was shot in the hand as he (chased (the robbers)NP (in the back street)PP) (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 16. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms in detail • Applications to NLP Machine Learning for NLP 30/06/2003
  • 17. Classification Feature Vector Classification IA perspective • An instance is a vector: x=<x1,…, xn> whose components, called features (or attributes), are discrete or real-valued. • Let X be the space of all possible instances. • Let Y={y1,…, ym} be the set of categories (or classes). • The goal is to learn an unknown target function, f : X Y • A training example is an instance x belonging to X, labelled with the correct value for f(x), i.e., a pair <x, f(x)> • Let D be the set of all training examples. Machine Learning for NLP 30/06/2003
  • 18. Classification Feature Vector Classification • The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions The goal is to find a function h belonging to H such that for all pair <x, f (x)> belonging to D, h(x) = f (x) Machine Learning for NLP 30/06/2003
  • 19. Classification An Example Example SIZE COLOR SHAPE CLASS 1 small red circle positive 2 big red circle positive 3 small red triangle negative 4 big blue circle negative Rules Decision Tree COLOR (COLOR=red) red blue (SHAPE=circle) positive SHAPE negative otherwise negative circle triangle positive negative Machine Learning for NLP 30/06/2003
  • 20. Classification An Example Example SIZE COLOR SHAPE CLASS 1 small red circle positive 2 big red circle positive 3 small red triangle negative 4 big blue circle negative Rules Decision Tree SIZE (SIZE=small) (SHAPE=circle) positive small big (SIZE=big) (COLOR=red) positive otherwise negative SHAPE COLOR circle triang red blue pos neg pos neg Machine Learning for NLP 30/06/2003
  • 21. Classification Some important concepts • Inductive Bias “Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias” (Mooney & Cardie, 99) – Language / Search bias Decision Tree COLOR red blue SHAPE negative circle triangle positive negative Machine Learning for NLP 30/06/2003
  • 22. Classification Some important concepts • Inductive Bias • Training error and generalization error • Generalization ability and overfitting • Batch Learning vs. on-line Leaning • Symbolic vs. statistical Learning • Propositional vs. first-order learning Machine Learning for NLP 30/06/2003
  • 23. Classification Propositional vs. Relational Learning • Propositional learning color(red) shape(circle) classA • Relational learning = ILP (induction of logic programs) course(X) person(Y) link_to(Y,X) instructor_of(X,Y) research_project(X) person(Z) link_to(L1,X,Y) link_to(L2,Y,Z) neighbour_word_people(L1) member_proj(X,Z) Machine Learning for NLP 30/06/2003
  • 24. Classification The Classification Setting Class, Point, Example, Data Set, ... CoLT/SLT • Input Space: X Rn perspective • (binary) Output Space: Y = {+1,-1} • A point, pattern or instance: x X, x = (x1, x2, …, xn) • Example: (x, y) with x X, y Y • Training Set: a set of m examples generated i.i.d. according to an unknown distribution P(x,y) S = {(x1, y1), …, (xm, ym)} (X Y)m Machine Learning for NLP 30/06/2003
  • 25. Classification The Classification Setting Learning, Error, ... • The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions. In SVM are of the form: n h( x) wi i (x) b i 1 • The goal is to find a function h belonging to H such that the expected misclassification error on new examples, also drawn from P(x,y), is minimal (Risk Minimization, RM) Machine Learning for NLP 30/06/2003
  • 26. Classification The Classification Setting Learning, Error, ... • Expected error (risk) Rh loss h(x), y dP (x, y ) • Problem: P itself is unknown. Known are training examples an induction principle is needed • Empirical Risk Minimization (ERM): Find the function h belonging to H for which the training error (empirical risk) is minimal m Remp h 1m i 1 loss h(x i ), yi Machine Learning for NLP 30/06/2003
  • 27. Classification The Classification Setting Error, Over(under)fitting,... • Low training error low true error? • The overfitting dilemma: Underfitting Overfitting • Trade-off between training error and complexity • Different learning biases can be used Machine Learning for NLP 30/06/2003
  • 28. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP Machine Learning for NLP 30/06/2003
  • 29. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms −Decision Trees −AdaBoost −Support Vector Machines • Applications to NLP Machine Learning for NLP 30/06/2003
  • 30. Algorithms Learning Paradigms • Statistical learning: – HMM, Bayesian Networks, ME, CRF, etc. • Traditional methods from Artificial Intelligence (ML, AI) – Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc. • Methods from Computational Learning Theory (CoLT/SLT) – Winnow, AdaBoost, SVM’s, etc. Machine Learning for NLP 30/06/2003
  • 31. Algorithms Learning Paradigms • Classifier combination: – Bagging, Boosting, Randomization, ECOC, Stacking, etc. • Semi-supervised learning: learning from labelled and unlabelled examples – Bootstrapping, EM, Transductive learning (SVM’s, AdaBoost), Co-Training, etc. • etc. Machine Learning for NLP 30/06/2003
  • 32. Algorithms Decision Trees • Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data. • They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization. • From a machine-learning perspective: Decision Trees are n -ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes Machine Learning for NLP 30/06/2003
  • 33. Algorithms Decision Trees • Acquisition: Top-Down Induction of Decision Trees (TDIDT) • Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87) (Kononenko et al. 95) etc. Machine Learning for NLP 30/06/2003
  • 34. Algorithms An Example A1 v1 v3 v2 A2 A3 ... A2 ... v4 v5 ... Decision Tree A5 A2 ... v6 SIZE small big A5 C3 SHAPE COLOR v7 circle triang red blue C1 C2 C1 pos neg pos neg Machine Learning for NLP 30/06/2003
  • 35. Algorithms Learning Decision Trees Training Training + TDIDT = Set DT Test Example + = Class DT Machine Learning for NLP 30/06/2003
  • 36. Algorithms General Induction Algorithm function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X) else amax := feature_selection (X,A); tree1 := create_tree (X, amax); for-all val in values (amax) do X’ := select_examples (X,amax,val); A’ := A - {amax}; tree2 := TDIDT (X’,A’); tree1 := add_branch (tree1,tree2,val) end-for end-if return (tree1) end-function Machine Learning for NLP 30/06/2003
  • 37. Algorithms General Induction Algorithm function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X) else amax := feature_selection (X,A); tree1 := create_tree (X, amax); for-all val in values (amax) do X’ := select_examples (X,amax,val); A’ := A - {amax}; tree2 := TDIDT (X’,A’); tree1 := add_branch (tree1,tree2,val) end-for end-if return (tree1) end-function Machine Learning for NLP 30/06/2003
  • 38. Algorithms Feature Selection Criteria • Functions derived from Information Theory: – Information Gain, Gain Ratio (Quinlan 86) • Functions derived from Distance Measures – Gini Diversity Index (Breiman et al. 84) – RLM (López de Mántaras 91) • Statistically-based – Chi-square test (Sestito & Dillon 94) – Symmetrical Tau (Zhou & Dillon 91) • RELIEFF-IG: variant of RELIEFF (Kononenko 94) Machine Learning for NLP 30/06/2003
  • 39. Algorithms Extensions of DTs (Murthy 95) • Pruning (pre/post) • Minimize the effect of the greedy approach: lookahead • Non-lineal splits • Combination of multiple models • Incremental learning (on-line) • etc. Machine Learning for NLP 30/06/2003
  • 40. Algorithms Decision Trees and NLP • Speech processing (Bahl et al. 89; Bakiri & Dietterich 99) • POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00) • Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96) • Parsing (Magerman 95,96; Haruno et al. 98,99) • Text categorization (Lewis & Ringuette 94; Weiss et al. 99) • Text summarization (Mani & Bloedorn 98) • Dialogue act tagging (Samuel et al. 98) Machine Learning for NLP 30/06/2003
  • 41. Algorithms Decision Trees and NLP • Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95) • Discourse analysis in information extraction (Soderland & Lehnert 94) • Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94) • Verb classification in Machine Translation (Tanaka 96; Siegel 97) Machine Learning for NLP 30/06/2003
  • 42. Algorithms Decision Trees: pros&cons • Advantages – Acquires symbolic knowledge in a understandable way – Very well studied ML algorithms and variants – Can be easily translated into rules – Existence of available software: C4.5, C5.0, etc. – Can be easily integrated into an ensemble Machine Learning for NLP 30/06/2003
  • 43. Algorithms Decision Trees: pros&cons • Drawbacks – Computationally expensive when scaling to large natural language domains: training examples, features, etc. – Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation – DTs is a model with high variance (unstable) – Tendency to overfit training data: pruning is necessary – Requires quite a big effort in tuning the model Machine Learning for NLP 30/06/2003
  • 44. Algorithms Boosting algorithms • Idea “to combine many simple and moderately accurate hypotheses (weak classifiers) into a single and highly accurate classifier” • AdaBoost (Freund & Schapire 95) has been theoretically and empirically studied extensively • Many other variants extensions (1997-2003) Machine Learning for NLP 30/06/2003
  • 45. Algorithms AdaBoost: general scheme Linear F(h1,h2,...,hT) combination 2 ... h1 h2 hT Weak Weak Weak Learner Probability Learner Learner distribution updating TS1 TS2 ... TST D1 D2 DT Machine Learning for NLP 30/06/2003
  • 46. Algorithms AdaBoost: algorithm (Freund & Schapire 97) Machine Learning for NLP 30/06/2003
  • 47. Algorithms AdaBoost: example Weak hypotheses = vertical/horizontal hyperplanes Machine Learning for NLP 30/06/2003
  • 48. Algorithms AdaBoost: round 1 Machine Learning for NLP 30/06/2003
  • 49. Algorithms AdaBoost: round 2 Machine Learning for NLP 30/06/2003
  • 50. Algorithms AdaBoost: round 3 Machine Learning for NLP 30/06/2003
  • 51. Algorithms Combined Hypothesis Machine Learning for NLP 30/06/2003
  • 52. Algorithms AdaBoost and NLP • POS Tagging (Abney et al. 99; Màrquez 99) • Text and Speech Categorization (Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99) • PP-attachment Disambiguation (Abney et al. 99) • Parsing (Haruno et al. 99) • Word Sense Disambiguation (Escudero et al. 00, 01) • Shallow parsing (Carreras & Màrquez, 01a; 02) • Email spam filtering (Carreras & Màrquez, 01b) • Term Extraction (Vivaldi, et al. 01) Machine Learning for NLP 30/06/2003
  • 53. Algorithms AdaBoost: pros&cons + Easy to implement and few parameters to set + Time and space grow linearly with number of examples. Ability to manage very large learning problems + Does not constrain explicitly the complexity of the learner + Naturally combines feature selection with learning + Has been succesfully applied to many practical problems Machine Learning for NLP 30/06/2003
  • 54. Algorithms AdaBoost: pros&cons ± Seems to be rather robust to overfitting (number of rounds) but sensitive to noise ± Performance is very good when there are relatively few relevant terms (features) – Can perform poorly when there is insufficient training data relative to the complexity of the base classifiers, the training errors of the base classifiers become too large too quickly Machine Learning for NLP 30/06/2003
  • 55. Algorithms SVM: A General Definition • “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000) Machine Learning for NLP 30/06/2003
  • 56. Algorithms SVM: A General Definition • “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000) Key Concepts Machine Learning for NLP 30/06/2003
  • 57. Algorithms Linear Classifiers • Hyperplanes in RN. • Defined by a weight vector (w) and a threshold (b). • They induce a classification rule: N N 1 if wi xi b 0 h(x) sign wi xi b i 1 i 1 1 otherwise + + + + + _ _ w _ _ + b _ _ _ w _ _ Machine Learning for NLP 30/06/2003
  • 58. Algorithms Optimal Hyperplane: Geometric Intuition Machine Learning for NLP 30/06/2003
  • 59. Algorithms Optimal Hyperplane: Geometric Intuition These are the Support Vectors Maximal Margin Hyperplane Machine Learning for NLP 30/06/2003
  • 60. Algorithms Linearly separable data Quadratic geometricmargin 2 / w 2 Programming 2 maximizing the margin is equivalent to minimize w subject to constraint s : yi ( w xi b) 1 for all i 1,, l Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  • 61. Algorithms Non-separable case (soft margin) 1 ,, l positiveslack vari ables for introducin costs g n 2 Minimize w C i subject toconstraint : s i 1 yi ( w xi b) 1 i for all i 1,, l i 0 for all i 1,, l Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  • 62. Algorithms Non-linear SVMs • Implicit mapping into feature space via kernel functions :X F Non-linear mapping n f ( x) wi i (x) b Set of hypotheses i 1 l f (x) i yi (xi ) (x) b Dual formulation i 1 K (x, z) (x) (z) Kernel function l f ( x) i yi K (xi , x) b Evaluation i 1 Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  • 63. Algorithms Non-linear SVMs • Kernel functions – Must be efficiently computable – Characterization via Mercer’s theorem – One of the curious facts about using a kernel is that we do not need to know the underlying feature map in order to be able to learn in the feature space! (Cristianini & Shawe-Taylor, 2000) – Examples: polynomials, Gaussian radial basis functions, two-layer sigmoidal neural networks, etc. Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  • 64. Algorithms Non linear SVMs Degree 3 polynomial kernel lin. separable lin. non-separable Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  • 65. Algorithms Toy Examples • All examples have been run with the 2D graphic interface of SVMLIB (Chang and Lin, National University of Taiwan) “LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. The basic algorithm is a simplification of both SMO by Platt and SVMLight by Joachims. It is also a simplification of the modification 2 of SMO by Keerthy et al. Our goal is to help users from other fields to easily use SVM as a tool. LIBSVM provides a simple interface where users can easily link it with their own programs…” • Available from: (it icludes a Web integrated demo tool) Machine Learning for NLP 30/06/2003
  • 66. Algorithms Toy Examples (I) Linearly separable data set Linear SVM Maximal margin Hyperplane . What happens if we add a blue training example here? Machine Learning for NLP 30/06/2003
  • 67. Algorithms Toy Examples (I) (still) Linearly separable data set Linear SVM High value of C parameter Maximal margin Hyperplane The example is correctly classified Machine Learning for NLP 30/06/2003
  • 68. Algorithms Toy Examples (I) (still) Linearly separable data set Linear SVM Low value of C parameter Trade-off between: margin and training error The example is now a bounded SV Machine Learning for NLP 30/06/2003
  • 69. Algorithms Toy Examples (II) Machine Learning for NLP 30/06/2003
  • 70. Algorithms Toy Examples (II) Machine Learning for NLP 30/06/2003
  • 71. Algorithms Toy Examples (II) Machine Learning for NLP 30/06/2003
  • 72. Algorithms Toy Examples (III) Machine Learning for NLP 30/06/2003
  • 73. Algorithms SVM: Summary • SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik, 1992). Great developement since then • Kernel-induced feature spaces: SVMs work efficiently in very high dimensional feature spaces (+) • Learning bias: maximal margin optimisation. Reduces the danger of overfitting. Generalization bounds for SVMs (+) • Compact representation of the induced hypothesis. The solution is sparse in terms of SVs (+) Machine Learning for NLP 30/06/2003
  • 74. Algorithms SVM: Summary • Due to Mercer’s conditions on the kernels the optimi- sation problems are convex. No local minima (+) • Optimisation theory guides the implementation. Efficient learning (+) • Mainly for classification but also for regression, density estimation, clustering, etc. • Success in many real-world applications: OCR, vision, bioinformatics, speech recognition, NLP: TextCat, POS tagging, chunking, parsing, etc. (+) • Parameter tuning (–). Implications in convergence times, sparsity of the solution, etc. Machine Learning for NLP 30/06/2003
  • 75. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP Machine Learning for NLP 30/06/2003
  • 76. Applications NLP problems • Warning! We will not focus on final NLP applications, but on intermediate tasks... • We will classify the NLP tasks according to their (structural) complexity Machine Learning for NLP 30/06/2003
  • 77. Applications NLP problems: structural complexity • Decisional problems − Text Categorization, Document filtering, Word Sense Disambiguation, etc. • Sequence tagging and detection of sequential structures − POS tagging, Named Entity extraction, syntactic chunking, etc. • Hierarchical structures − Clause detection, full parsing, IE of complex concepts, composite Named Entities, etc. Machine Learning for NLP 30/06/2003
  • 78. Applications POS tagging • Morpho-syntactic ambiguity: Part of Speech Tagging He was shot in the hand as he chased the robbers in the back street JJ NN NN VB VB VB (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  • 79. Applications POS tagging “preposition-adverb” tree root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB others ... P(IN)=0.13 Probabilistic interpretation: P(RB)=0.87 tag(+2) ^ P( RB | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.987 IN ^ P( IN | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.013 P(IN)=0.013 leaf P(RB)=0.987 Machine Learning for NLP 30/06/2003
  • 80. Applications POS tagging “preposition-adverb” tree root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB Collocations: others “as_RB much_RB as_IN” ... P(IN)=0.13 P(RB)=0.87 tag(+2) “as_RB soon_RB as_IN” IN “as_RB well_RB as_IN” P(IN)=0.013 leaf P(RB)=0.987 Machine Learning for NLP 30/06/2003
  • 81. Applications POS tagging RTT (Màrquez & Rodríguez 97) Language Model A Sequential Model for Multi-class Classification: NLP/POS Tagging (Even-Zohar & Roth, 01) stop? Morphological Tagged Raw analysis Classify Update Filter text yes text no Disambiguation Machine Learning for NLP 30/06/2003
  • 82. Applications POS tagging STT (Màrquez & Rodríguez 97) Language Model Lexical probs. + The Use of Classifiers in sequential inference: Contextual probs. Chunking (Punyakanok & Roth, 00) Raw Morphological Viterbi Tagged analysis algorithm text text Disambiguation Machine Learning for NLP 30/06/2003
  • 83. Applications Detection of sequential and hierarchical structures • Named Entity recognition • Clause detection Machine Learning for NLP 30/06/2003
  • 84. Conclusions Summary/conclusions • We have briefly outlined: −The ML setting: “supervised learning for classification” −Three concrete machine learning algorithms −How to apply them to solve itermediate NLP tasks Machine Learning for NLP 30/06/2003
  • 85. Conclusions Summary/conclusions • Any ML algorithm for NLP should be: – Robust to noise and outliers – Efficient in large feature/example spaces – Adaptive to new/changing domains: portability, tuning, etc. – Able to take advantage of unlabelled examples: semi-supervised learning Machine Learning for NLP 30/06/2003
  • 86. Conclusions Summary/conclusions • Statistical and ML-based Natural Language Processing is a very active and multidisciplinary area of research Machine Learning for NLP 30/06/2003
  • 87. Conclusions Some current research lines • Appropriate learning paradigm for all kind of NLP problems: TiMBL (DBZ99), TBEDL (Brill95), ME (Ratnaparkhi98), SNoW (Roth98), CRF (Pereira & Singer02). • Definition of an adequate (and task-specific) feature space: mapping from the input space to a high dimensional feature space, kernels, etc. • Resolution of complex NLP problems: inference with classifiers + constraint satisfaction • etc. Machine Learning for NLP 30/06/2003
  • 88. Conclusions Bibliografia • You may found additional information at: tesi.html publicacions/pubs.html cursos/talks.html cursos/MLandNL.html cursos/emnlp1.html • This talk at: Machine Learning for NLP 30/06/2003
  • 89. Seminar: Statistical NLP Machine Learning for Natural Language Processing Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Girona, June 2003 Machine Learning for NLP 30/06/2003