Machine Learning for NLP

Seminar: Statistical NLP

Machine Learning for
Natural Language Processing
Lluís Màrquez
TALP Research Center
Llenguatges i Sistemes Informàtics
Universitat Politècnica de Catalunya

Girona, June 2003

Machine Learning for NLP 30/06/2003

Outline
• Machine Learning for NLP
• The Classification Problem
• Three ML Algorithms
• Applications to NLP


ML4NLP
Machine Learning
• There are many general-purpose definitions of Machine
Learning (or artificial learning):

Making a computer automatically acquire some
kind of knowledge from a concrete data domain

• Learners are computers: we study learning algorithms
• Resources are scarce: time, memory, data, etc.
• It has (almost) nothing to do with: Cognitive science,
neuroscience, theory of scientific discovery and research, etc.
• Biological plausibility is welcome but not the main goal


ML4NLP
Machine Learning
• Learning... but what for?
– To perform some particular task
– To react to environmental inputs
– Concept learning from data:
• modelling concepts underlying data
• predicting unseen observations
• compacting the knowledge representation
• knowledge discovery for expert systems

• We will concentrate on:
– Supervised inductive learning for classification
= discriminative learning


ML4NLP
Machine Learning

A more precise definition:

Obtaining a description of the concept in
some representation language that explains
observations and helps predicting new
instances of the same distribution

• What to read?
– Machine Learning (Mitchell, 1997)


ML4NLP
Empirical NLP
90’s: Application of Machine Learning techniques
(ML) to NLP problems

• Lexical and structural ambiguity problems
– Word selection (SR, MT)
– Part-of-speech tagging
Clasification
– Semantic ambiguity (polysemy)
problems
– Prepositional phrase attachment
– Reference ambiguity (anaphora)
– etc.

• What to read? Foundations of Statistical Language
Processing (Manning & Schütze, 1999)


ML4NLP

NLP “classification” problems
• Ambiguity is a crucial problem for natural
language understanding/processing.
Ambiguity Resolution = Classification

He was shot in the hand as he chased
the robbers in the back street

(The Wall Street Journal Corpus)


ML4NLP


• Morpho-syntactic ambiguity

the robbers in the back street JJ
NN NN
VB VB VB



ML4NLP


• Morpho-syntactic ambiguity:
Part of Speech Tagging

NN NN
VB VB VB



ML4NLP


• Semantic (lexical) ambiguity

the robbers in body-part street
the back
clock-part



ML4NLP


• Semantic (lexical) ambiguity:
Word Sense Disambiguation

the robbers in body-part street
the back
clock-part



ML4NLP


• Structural (syntactic) ambiguity

the robbers in the back street



ML4NLP


• Structural (syntactic) ambiguity:
PP-attachment disambiguation

He was shot in the hand as he (chased
(the robbers)NP (in the back street)PP)



Outline
• Three ML Algorithms in detail


Classification

Feature Vector Classification
IA
perspective
• An instance is a vector: x=<x1,…, xn> whose components,
called features (or attributes), are discrete or real-valued.

• Let X be the space of all possible instances.

• Let Y={y1,…, ym} be the set of categories (or classes).

• The goal is to learn an unknown target function, f : X Y
• A training example is an instance x belonging to X, labelled
with the correct value for f(x), i.e., a pair <x, f(x)>

• Let D be the set of all training examples.


Classification

Feature Vector Classification

• The hypotheses space, H, is the set of functions h: X Y
that the learner can consider as possible definitions

The goal is to find a function h belonging to H
such that for all pair <x, f (x)> belonging to D,
h(x) = f (x)


Classification
An Example
Example SIZE COLOR SHAPE CLASS
1 small red circle positive
2 big red circle positive
3 small red triangle negative
4 big blue circle negative

Rules Decision Tree
COLOR
(COLOR=red)
red blue
(SHAPE=circle) positive
SHAPE negative
otherwise negative
circle triangle

positive negative


Classification
An Example
Example SIZE COLOR SHAPE CLASS
1 small red circle positive
2 big red circle positive
3 small red triangle negative
4 big blue circle negative

Rules Decision Tree
SIZE
(SIZE=small) (SHAPE=circle) positive
small big
(SIZE=big) (COLOR=red) positive
otherwise negative SHAPE COLOR
circle triang red blue

pos neg pos neg


Classification

Some important concepts
• Inductive Bias
“Any means that a classification learning system uses to choose
between to functions that are both consistent with the training
data is called inductive bias” (Mooney & Cardie, 99)
– Language / Search bias

Decision Tree
COLOR

red blue

SHAPE negative

circle triangle

positive negative


Classification

Some important concepts
• Inductive Bias

• Training error and generalization error

• Generalization ability and overfitting

• Batch Learning vs. on-line Leaning

• Symbolic vs. statistical Learning

• Propositional vs. first-order learning


Classification

Propositional vs.
Relational Learning
• Propositional learning

color(red) shape(circle) classA

• Relational learning = ILP (induction of logic programs)

course(X) person(Y) link_to(Y,X) instructor_of(X,Y)

research_project(X) person(Z) link_to(L1,X,Y)
link_to(L2,Y,Z) neighbour_word_people(L1) member_proj(X,Z)


Classification
The Classification Setting
Class, Point, Example, Data Set, ...
CoLT/SLT
• Input Space: X Rn perspective

• (binary) Output Space: Y = {+1,-1}
• A point, pattern or instance:
x X, x = (x1, x2, …, xn)
• Example: (x, y) with x X, y Y
• Training Set: a set of m examples generated i.i.d.
according to an unknown distribution P(x,y)
S = {(x1, y1), …, (xm, ym)} (X Y)m


Classification
Learning, Error, ...
• The hypotheses space, H, is the set of functions
h: X Y that the learner can consider as possible
definitions. In SVM are of the form:
n
h( x) wi i (x) b
i 1

• The goal is to find a function h belonging to H such
that the expected misclassification error on new
examples, also drawn from P(x,y), is minimal
(Risk Minimization, RM)


Classification
Learning, Error, ...
• Expected error (risk)

Rh loss h(x), y dP (x, y )

• Problem: P itself is unknown. Known are training
examples an induction principle is needed
• Empirical Risk Minimization (ERM): Find the
function h belonging to H for which the training
error (empirical risk) is minimal
m
Remp h 1m i 1
loss h(x i ), yi


Classification
Error, Over(under)fitting,...
• Low training error low true error?
• The overfitting dilemma:

Underfitting Overfitting
• Trade-off between training error and complexity
• Different learning biases can be used

Outline
• Three ML Algorithms
−Decision Trees
−AdaBoost
−Support Vector Machines


Algorithms
Learning Paradigms

• Statistical learning:
– HMM, Bayesian Networks, ME, CRF, etc.
• Traditional methods from Artificial Intelligence
(ML, AI)
– Decision trees/lists, exemplar-based learning, rule
induction, neural networks, etc.

• Methods from Computational Learning
Theory (CoLT/SLT)
– Winnow, AdaBoost, SVM’s, etc.


Algorithms
Learning Paradigms

• Classifier combination:
– Bagging, Boosting, Randomization, ECOC,
Stacking, etc.

• Semi-supervised learning: learning from
labelled and unlabelled examples
– Bootstrapping, EM, Transductive learning
(SVM’s, AdaBoost), Co-Training, etc.

• etc.


Algorithms
Decision Trees
• Decision trees are a way to represent rules underlying
training data, with hierarchical structures that
recursively partition the data.

• They have been used by many research communities
(Pattern Recognition, Statistics, ML, etc.) for data
exploration with some of the following purposes:
Description, Classification, and Generalization.

• From a machine-learning perspective: Decision Trees
are n -ary branching trees that represent classification
rules for classifying the objects of a certain domain into
a set of mutually exclusive classes


Algorithms
Decision Trees
• Acquisition:
Top-Down Induction of Decision Trees
(TDIDT)
• Systems:
CART (Breiman et al. 84),
ID3, C4.5, C5.0 (Quinlan 86,93,98),
ASSISTANT, ASSISTANT-R (Cestnik et al. 87)
(Kononenko et al. 95)
etc.


Algorithms
An Example
A1
v1 v3
v2
A2 A3 ... A2

... v4 v5 ...
Decision Tree A5 A2

... v6
SIZE
small big
A5 C3

SHAPE COLOR v7
circle triang red blue
C1 C2 C1
pos neg pos neg


Algorithms
Learning Decision Trees
Training

Training
+ TDIDT =
Set
DT

Test

Example + = Class
DT


Algorithms

General Induction Algorithm
function TDIDT (X:set-of-examples; A:set-of-features)
var: tree1,tree2: decision-tree;
X’: set-of-examples;
A’: set-of-features
end-var
if (stopping_criterion (X)) then
tree1 := create_leaf_tree (X)
else
amax := feature_selection (X,A);
tree1 := create_tree (X, amax);
for-all val in values (amax) do
X’ := select_examples (X,amax,val);
A’ := A - {amax};
tree2 := TDIDT (X’,A’);
tree1 := add_branch (tree1,tree2,val)
end-for
end-if
return (tree1)
end-function


Algorithms

Feature Selection Criteria
• Functions derived from Information Theory:
– Information Gain, Gain Ratio (Quinlan 86)

• Functions derived from Distance Measures
– Gini Diversity Index (Breiman et al. 84)
– RLM (López de Mántaras 91)

• Statistically-based
– Chi-square test (Sestito & Dillon 94)
– Symmetrical Tau (Zhou & Dillon 91)

• RELIEFF-IG: variant of RELIEFF (Kononenko 94)


Algorithms
Extensions of DTs
(Murthy 95)

• Pruning (pre/post)
• Minimize the effect of the greedy approach:
lookahead
• Non-lineal splits
• Combination of multiple models
• Incremental learning (on-line)
• etc.


Algorithms
Decision Trees and NLP
• Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)
• POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez &
Rodríguez 95,97; Màrquez et al. 00)

• Word sense disambiguation (Brown et al. 91; Cardie 93;
Mooney 96)

• Parsing (Magerman 95,96; Haruno et al. 98,99)
• Text categorization (Lewis & Ringuette 94; Weiss et al. 99)
• Text summarization (Mani & Bloedorn 98)
• Dialogue act tagging (Samuel et al. 98)


Algorithms
Decision Trees and NLP
• Noun phrase coreference
(Aone & Benett 95; Mc Carthy & Lehnert 95)

• Discourse analysis in information extraction
(Soderland & Lehnert 94)

• Cue phrase identification in text and speech
(Litman 94; Siegel & McKeown 94)

• Verb classification in Machine Translation
(Tanaka 96; Siegel 97)


Algorithms

Decision Trees: pros&cons

• Advantages
– Acquires symbolic knowledge in a
understandable way
– Very well studied ML algorithms and variants
– Can be easily translated into rules
– Existence of available software: C4.5, C5.0, etc.
– Can be easily integrated into an ensemble


Algorithms

Decision Trees: pros&cons
• Drawbacks
– Computationally expensive when scaling to large
natural language domains: training examples,
features, etc.
– Data sparseness and data fragmentation: the problem
of the small disjuncts => Probability estimation
– DTs is a model with high variance (unstable)
– Tendency to overfit training data: pruning is necessary
– Requires quite a big effort in tuning the model


Algorithms
Boosting algorithms
• Idea
“to combine many simple and moderately accurate
hypotheses (weak classifiers) into a single and highly
accurate classifier”

• AdaBoost (Freund & Schapire 95) has been
theoretically and empirically studied extensively

• Many other variants extensions (1997-2003)
http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html


Algorithms
AdaBoost: general scheme

Linear
F(h1,h2,...,hT)
combination
2

...
h1 h2 hT

Weak Weak Weak
Learner Probability Learner
Learner
distribution
updating

TS1 TS2
... TST
D1 D2 DT

Algorithms
AdaBoost: algorithm

(Freund & Schapire 97)


Algorithms
AdaBoost: example

Weak hypotheses = vertical/horizontal hyperplanes


Algorithms
AdaBoost: round 1


Algorithms
AdaBoost: round 2


Algorithms
AdaBoost: round 3


Algorithms
Combined Hypothesis

www.research.att.com/~yoav/adaboost


Algorithms
AdaBoost and NLP
• POS Tagging (Abney et al. 99; Màrquez 99)
• Text and Speech Categorization
(Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99)

• PP-attachment Disambiguation (Abney et al. 99)
• Parsing (Haruno et al. 99)
• Word Sense Disambiguation (Escudero et al. 00, 01)
• Shallow parsing (Carreras & Màrquez, 01a; 02)
• Email spam filtering (Carreras & Màrquez, 01b)
• Term Extraction (Vivaldi, et al. 01)


Algorithms

AdaBoost: pros&cons
+ Easy to implement and few parameters to set
+ Time and space grow linearly with number of
examples. Ability to manage very large learning
problems
+ Does not constrain explicitly the complexity of the
learner
+ Naturally combines feature selection with learning
+ Has been succesfully applied to many practical
problems


Algorithms

AdaBoost: pros&cons

± Seems to be rather robust to overfitting
(number of rounds) but sensitive to noise

± Performance is very good when there are
relatively few relevant terms (features)

– Can perform poorly when there is insufficient
training data relative to the complexity of the
base classifiers, the training errors of the base
classifiers become too large too quickly


Algorithms

SVM: A General Definition
• “Support Vector Machines (SVM) are learning
systems that use a hypothesis space of linear
functions in a high dimensional feature space,
trained with a learning algorithm from optimisation
theory that implements a learning bias derived
from statistical learning theory”.
(Cristianini & Shawe-Taylor, 2000)


Algorithms

SVM: A General Definition

• “Support Vector Machines (SVM) are learning
systems that use a hypothesis space of linear
functions in a high dimensional feature space,
trained with a learning algorithm from optimisation
theory that implements a learning bias derived
from statistical learning theory”.
(Cristianini & Shawe-Taylor, 2000)
Key Concepts


Algorithms
Linear Classifiers
• Hyperplanes in RN.
• Defined by a weight vector (w) and a threshold (b).
• They induce a classification rule:

N
N
1 if wi xi b 0
h(x) sign wi xi b i 1
i 1
1 otherwise
+
+ +
+ +
_ _ w
_ _ +
b _
_ _
w _
_


Algorithms
Optimal Hyperplane:
Geometric Intuition


Algorithms
Optimal Hyperplane:
Geometric Intuition
These are the
Support
Vectors
Maximal
Margin
Hyperplane


Algorithms
Linearly separable data

Quadratic
geometricmargin 2 / w 2 Programming
2
maximizing the margin is equivalent to minimize w subject to constraint s :

yi ( w xi b) 1 for all i 1,, l

Seminari SVMs Machine Learning for NLP 22/05/2001
30/06/2003

Algorithms
Non-separable case (soft margin)

1 ,, l positiveslack vari
ables for introducin costs
g
n
2
Minimize w C i subject toconstraint :
s
i 1

yi ( w xi b) 1 i for all i 1,, l
i 0 for all i 1,, l

30/06/2003

Algorithms
Non-linear SVMs
• Implicit mapping into feature space via kernel functions

:X F Non-linear mapping
n
f ( x) wi i (x) b Set of hypotheses
i 1
l
f (x) i yi (xi ) (x) b Dual formulation
i 1

K (x, z) (x) (z) Kernel function
l
f ( x) i yi K (xi , x) b Evaluation
i 1

30/06/2003

Algorithms
Non-linear SVMs
• Kernel functions
– Must be efficiently computable

– Characterization via Mercer’s theorem

– One of the curious facts about using a kernel is
that we do not need to know the underlying
feature map in order to be able to learn in the
feature space! (Cristianini & Shawe-Taylor, 2000)
– Examples: polynomials, Gaussian radial basis
functions, two-layer sigmoidal neural networks,
etc.

30/06/2003

Algorithms

Non linear SVMs
Degree 3 polynomial kernel

lin. separable lin. non-separable

30/06/2003

Algorithms

Toy Examples
• All examples have been run with the 2D graphic
interface of SVMLIB (Chang and Lin, National University
of Taiwan)
“LIBSVM is an integrated software for support vector classification,
(C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution
estimation (one-class SVM). It supports multi-class classification. The
basic algorithm is a simplification of both SMO by Platt and SVMLight
by Joachims. It is also a simplification of the modification 2 of SMO by
Keerthy et al. Our goal is to help users from other fields to easily use
SVM as a tool. LIBSVM provides a simple interface where users can
easily link it with their own programs…”

• Available from: www.csie.ntu.edu.tw/~cjlin/libsvm
(it icludes a Web integrated demo tool)


Algorithms
Toy Examples (I)

Linearly separable data set
Linear SVM
Maximal margin Hyperplane

. What happens if we add
a blue training example
here?


Algorithms
Toy Examples (I)

(still) Linearly separable
data set
Linear SVM
High value of C parameter
Maximal margin Hyperplane

The example is
correctly classified


Algorithms
Toy Examples (I)

(still) Linearly separable
data set
Linear SVM
Low value of C parameter
Trade-off between: margin
and training error

The example is
now a bounded SV


Algorithms
Toy Examples (II)


Algorithms
Toy Examples (III)


Algorithms

SVM: Summary
• SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik,
1992). Great developement since then

• Kernel-induced feature spaces: SVMs work efficiently
in very high dimensional feature spaces (+)

• Learning bias: maximal margin optimisation.
Reduces the danger of overfitting. Generalization
bounds for SVMs (+)

• Compact representation of the induced hypothesis.
The solution is sparse in terms of SVs (+)


Algorithms

SVM: Summary
• Due to Mercer’s conditions on the kernels the optimi-
sation problems are convex. No local minima (+)
• Optimisation theory guides the implementation.
Efficient learning (+)
• Mainly for classification but also for regression,
density estimation, clustering, etc.
• Success in many real-world applications: OCR, vision,
bioinformatics, speech recognition, NLP: TextCat, POS
tagging, chunking, parsing, etc. (+)
• Parameter tuning (–). Implications in convergence
times, sparsity of the solution, etc.


Applications

NLP problems

• Warning! We will not focus on
final NLP applications, but on
intermediate tasks...

• We will classify the NLP tasks
according to their (structural)
complexity


Applications
NLP problems: structural
complexity
• Decisional problems
− Text Categorization, Document filtering, Word
Sense Disambiguation, etc.
• Sequence tagging and detection of
sequential structures
− POS tagging, Named Entity extraction,
syntactic chunking, etc.
• Hierarchical structures
− Clause detection, full parsing, IE of complex
concepts, composite Named Entities, etc.

Applications

POS tagging
• Morpho-syntactic ambiguity:
Part of Speech Tagging

NN NN
VB VB VB



Applications

POS tagging
“preposition-adverb” tree
root
P(IN)=0.81
P(RB)=0.19
Word Form
“As”,“as”
others
... P(IN)=0.83
P(RB)=0.17
tag(+1)
RB
others
... P(IN)=0.13
Probabilistic interpretation: P(RB)=0.87
tag(+2)
^
P( RB | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.987 IN
^
P( IN | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.013
P(IN)=0.013
leaf
P(RB)=0.987


Applications

POS tagging
“preposition-adverb” tree
root
P(IN)=0.81
P(RB)=0.19
Word Form
“As”,“as”
others
... P(IN)=0.83
P(RB)=0.17
tag(+1)
RB
Collocations: others

“as_RB much_RB as_IN”
... P(IN)=0.13
P(RB)=0.87
tag(+2)
“as_RB soon_RB as_IN” IN

“as_RB well_RB as_IN”
P(IN)=0.013
leaf
P(RB)=0.987


Applications
POS tagging
RTT (Màrquez & Rodríguez 97)

Language
Model

A Sequential Model for Multi-class Classification:
NLP/POS Tagging (Even-Zohar & Roth, 01)
stop?
Morphological Tagged
Raw
analysis
Classify Update Filter
text yes text
no

Disambiguation


Applications
POS tagging
STT (Màrquez & Rodríguez 97)

Language Model

Lexical
probs. +
The Use of Classifiers in sequential inference:
Contextual probs.
Chunking (Punyakanok & Roth, 00)

Raw Morphological Viterbi Tagged
analysis algorithm
text text

Disambiguation

Applications

Detection of sequential and
hierarchical structures

• Named Entity recognition
• Clause detection


Conclusions

Summary/conclusions
• We have briefly outlined:
−The ML setting: “supervised learning for
classification”
−Three concrete machine learning
algorithms
−How to apply them to solve itermediate
NLP tasks


Conclusions

Summary/conclusions
• Any ML algorithm for NLP should be:
– Robust to noise and outliers
– Efficient in large feature/example spaces
– Adaptive to new/changing domains:
portability, tuning, etc.
– Able to take advantage of unlabelled
examples: semi-supervised learning


Conclusions

Summary/conclusions
• Statistical and ML-based Natural
Language Processing is a very active
and multidisciplinary area of research


Conclusions

Some current research lines
• Appropriate learning paradigm for all kind of
NLP problems: TiMBL (DBZ99), TBEDL (Brill95), ME
(Ratnaparkhi98), SNoW (Roth98), CRF (Pereira & Singer02).

• Definition of an adequate (and task-specific)
feature space: mapping from the input space to a
high dimensional feature space, kernels, etc.

• Resolution of complex NLP problems:
inference with classifiers + constraint satisfaction

• etc.


Conclusions
Bibliografia
• You may found additional information at:
http://www.lsi.upc.es/~lluism/
tesi.html
publicacions/pubs.html
cursos/talks.html
cursos/MLandNL.html
cursos/emnlp1.html

• This talk at:
http://www.lsi.upc.es/~lluism/udg03.ppt.gz


Machine Learning for NLP

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

More from butest

More from butest (20)

Machine Learning for NLP