Separation of Lanthanides/ Lanthanides and Actinides
PhD Dissertation on Document Classification Models
1. PhD Dissertation
Document Classification Models TC
Notation
Representation
based on Bayesian networks Particularities
Evaluation
Alfonso E. Romero Bayesian networks
Definition
Storage problems
Canonical models
April 27, 2010
OR Gate
Models
Experiments
Thesaurus
Definitions
Unsupervised model
Supervised model
Experiments
Structured
Structured documents
Advisors: Luis M. de Campos, and Transformations
Experiments
Juan M. Fernández-Luna Link-based categorization
Multiclass model
Department of Computer Science and A.I. Experiments
Multilabel model
University of Granada Experiments
Remarks
PhD Dissertation.1
2. A brief overview: the problem I
We shall solve problems in document categorization...
TC
Notation
Representation
Particularities
Evaluation
New HSL New HSL
Madrid-Valencia Madrid-Valencia Bayesian networks
to open in
December, 2010 ? to open in
December, 2010
Definition
Storage problems
Canonical models
National OR Gate
Models
Experiments
Thesaurus
Definitions
Unsupervised model
Supervised model
Experiments
Structured
Structured documents
Sports Society World National Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.2
3. A brief overview: the problem II
... in particular, automatic document categorization...
TC
Notation
Representation
Particularities
Evaluation
New HSL New HSL
Madrid-Valencia Madrid-Valencia Bayesian networks
to open in
December, 2010 ? to open in
December, 2010
Definition
Storage problems
Canonical models
National OR Gate
Models
Experiments
Thesaurus
Definitions
Unsupervised model
Supervised model
Experiments
Structured
Structured documents
Sports Society World National Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.3
4. A brief overview: the problem III
... concretely, supervised document categorization.
TC
Notation
Representation
Particularities
The Alhambra of Learning procedure Evaluation
Granada, most
visited
National Bayesian networks
monument in 2009 Definition
Storage problems
Canonical models
Prime minister OR Gate
Zapatero to give a Algorithm (classifier) Models
speech on the National Experiments
Parliament
Thesaurus
Definitions
Unsupervised model
AVE train to run at Supervised model
350 km/h in 2010 Experiments
from Madrid to National
Structured
Barcelona Structured documents
Transformations
Experiments
Labeled corpus
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.4
5. A brief overview: the methods I
Which learning method shall we use?
TC
Notation
Representation
Neural networks
Particularities
Evaluation
Support Vector Machines
Bayesian networks
Definition
Storage problems
k-NN�methods
Canonical models
OR Gate
Bayesian networks Models
Experiments
and probabilistic methods
Thesaurus
Definitions
Unsupervised model
Decision trees
Supervised model
Experiments
Structured
Evolutive algorithms Structured documents
Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.5
6. A brief overview: the methods II
The answer:
TC
Notation
Representation
Neural networks
Particularities
Evaluation
Support Vector Machines
Bayesian networks
Definition
Storage problems
k-NN�methods
Canonical models
OR Gate
Bayesian networks Models
Experiments
and probabilistic methods
Thesaurus
Definitions
Unsupervised model
Decision trees
Supervised model
Experiments
Evolutive algorithms
Structured
Structured documents
Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.6
7. A brief overview: the methods III
TC
But, why? Notation
Representation
Particularities
Evaluation
Bayesian networks
• Strong theoretical foundation (probability theory). Definition
Storage problems
Canonical models
OR Gate
• Models for (uncertain) knowledge representation. Models
Experiments
Thesaurus
Definitions
• Great success in related tasks (IR). Unsupervised model
Supervised model
Experiments
Structured
• Our background at the group UTAI. Structured documents
Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.7
8. Outline
1 Text Categorization
TC
Notation
Representation
Particularities
2 Bayesian networks Evaluation
Bayesian networks
Definition
Storage problems
3 An OR Gate-Based Text Classifier Canonical models
OR Gate
Models
Experiments
4 Automatic Indexing From a Thesaurus Using
Thesaurus
Bayesian Networks Definitions
Unsupervised model
Supervised model
Experiments
5 Structured Document Categorization Using Bayesian Structured
Structured documents
Networks Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
6 Final Remarks Multilabel model
Experiments
Remarks
PhD Dissertation.8
9. Outline
⇒ Text Categorization
TC
Notation
Representation
Particularities
2 Bayesian networks Evaluation
Bayesian networks
Definition
Storage problems
3 An OR Gate-Based Text Classifier Canonical models
OR Gate
Models
Experiments
4 Automatic Indexing From a Thesaurus Using
Thesaurus
Bayesian Networks Definitions
Unsupervised model
Supervised model
Experiments
5 Structured Document Categorization Using Bayesian Structured
Structured documents
Networks Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
6 Final Remarks Multilabel model
Experiments
Remarks
PhD Dissertation.9
10. Supervised Text Categorization I
Provided
TC
Notation
1 Set of labeled documents DTr (training). Representation
Particularities
2 C, set of categories/labels. Evaluation
Bayesian networks
The goal is to build a model f (classifier) capable of Definition
Storage problems
predicting categories (of C) of documents in D. Canonical models
OR Gate
Models
Different kinds of labeling: Experiments
Thesaurus
• f : D → {c, c} (binary). Definitions
Unsupervised model
• f : D → {c1 , c2 , . . . , cn } (multiclass). Supervised model
Experiments
• f : D × C → {0, 1} (multilabel). Structured
Structured documents
Transformations
A multilabel problem reduces to |C| binary problems Experiments
Link-based categorization
C = {c, c}. We often change the codomain from {0, 1} Multiclass model
Experiments
(hard classification) to [0, 1] (soft classification). Multilabel model
Experiments
Remarks
PhD Dissertation.10
11. Supervised Text Categorization II
Document representation:
As in Information Retrieval
TC
Stopwords removal + stemming + Vector Notation
representation (Frequency, binary or tf-idf). Representation
Particularities
Evaluation
From (preprocessed) document to vector Bayesian networks
Definition
Storage problems
Canonical models
term ⇔ dimension OR Gate
Models
Experiments
Thesaurus
Definitions
Unsupervised model
Example (beginning of John Milton's "Lost Paradise"): Supervised model
Experiments
Of Mans First Disobedience, and the Fruit Of Mans First Disobedience, and the Fruit
Structured
Of that Forbidden Tree, whose mortal tast Of that Forbidden Tree, whose mortal tast
Structured documents
Brought Death into the World, and all our woe, Brought Death into the World, and all our woe,
With loss of Eden, till one greater Man... With loss of Eden, till one greater Man... Transformations
Experiments
Link-based categorization
Multiclass model
2 1 1 1 1 1 1 1 1 1 1 1 1 1
Experiments
man obedience fruit forbid tree mortal tast bring death world woe loss eden great Multilabel model
Experiments
Remarks
PhD Dissertation.11
12. Supervised Text Categorization III
What are the particularities of the problem?
It differs from a “classic” Machine Learning problem in: TC
Notation
Representation
Particularities
Evaluation
• High dimensionality (easily > 10000).
Bayesian networks
Definition
Storage problems
Canonical models
• Very unbalanced datasets.
OR Gate
Models
Experiments
• |C| 0. Thesaurus
Definitions
Unsupervised model
Supervised model
Experiments
• Sometimes, there is a hierarchy in the set C.
Structured
Structured documents
Transformations
Experiments
• Sometimes, explicit relationships among Link-based categorization
Multiclass model
documents are given. Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.12
13. Supervised Text Categorization IV
Evaluation
How to measure the correctness of documents
TC
assigned to set of categories? Notation
Representation
• Binary/multiclass: Particularities
Evaluation
TP Bayesian networks
• Hard categorization Precision: TP+FP and Recall: Definition
Storage problems
TP 2PR
TP+FN , F1 : P+R . Canonical models
OR Gate
• Soft categorization Precision/Recall BEP. Models
Experiments
Thesaurus
Definitions
• Multilabel: micro and macro averages. Unsupervised model
Supervised model
Experiments
Structured
• Also, average precision on the 11 std. recall points Structured documents
Transformations
(category ranking). Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
• Standard corpora: Reuters, Ohsumed, 20 NG. . . Experiments
Remarks
PhD Dissertation.13
14. Outline
1 Text Categorization
TC
Notation
Representation
⇒ Bayesian networks Particularities
Evaluation
Bayesian networks
Definition
Storage problems
3 An OR Gate-Based Text Classifier Canonical models
OR Gate
Models
Experiments
4 Automatic Indexing From a Thesaurus Using
Thesaurus
Bayesian Networks Definitions
Unsupervised model
Supervised model
Experiments
5 Structured Document Categorization Using Bayesian Structured
Structured documents
Networks Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
6 Final Remarks Multilabel model
Experiments
Remarks
PhD Dissertation.14
15. Bayesian networks I
Definition and characteristics
A set of random variables X1 , . . . , XN in a DAG, verifying
P(X1 , . . . , Xn ) = n P(Xi |Pa(Xi )) ⇒ the graph
i=1 TC
Notation
represents independences. Representation
Particularities
Evaluation
Causal interpretation. Bayesian networks
Definition
Storage problems
Canonical models
Learning and inference methods available. OR Gate
Models
Experiments
Thesaurus
Definitions
Unsupervised model
Supervised model
Experiments
Structured
Structured documents
Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.15
16. Bayesian networks III
Estimation and storage problems
TC
The problem Notation
Representation
Particularities
• One value for each configuration of the parents. Evaluation
Bayesian networks
• General case: exponential number of parameters Definition
Storage problems
on the number of parents. Canonical models
OR Gate
Models
Experiments
The solution Thesaurus
Definitions
1 Few parents per node (not realistic in text). Unsupervised model
Supervised model
Experiments
2 Write the probability of a node as a deterministic Structured
function of the configuration (canonical model). Set Structured documents
Transformations
of parameters with linear size on the number of Experiments
Link-based categorization
parents. Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.16
17. Bayesian networks: canonical models
Components and examples of canonical models
• X = {Xi }: parents (causes), Y : child (effect). TC
Notation
Representation
• Xi in {xi , xi }, Yi in {yi , yi } (occurrence or not). Particularities
Evaluation
Bayesian networks
1 Noisy-OR gate model: Definition
p(y |X) = 1 − Xi ∈R(x) (1 − wOR (Xi , Y )). Storage problems
Canonical models
2 Additive model: p(y |X) = Xi ∈R(x) wadd (Xi , Y ).
OR Gate
Models
Experiments
Thesaurus
...
Definitions
X1 X2 XL
Unsupervised model
Supervised model
Experiments
Structured
Structured documents
Transformations
Experiments
Y Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.17
18. Outline
1 Text Categorization
TC
Notation
Representation
Particularities
2 Bayesian networks Evaluation
Bayesian networks
Definition
⇒ An OR Gate-Based Text Classifier Storage problems
Canonical models
OR Gate
Models
Experiments
4 Automatic Indexing From a Thesaurus Using
Thesaurus
Bayesian Networks Definitions
Unsupervised model
Supervised model
Experiments
5 Structured Document Categorization Using Bayesian Structured
Structured documents
Networks Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
6 Final Remarks Multilabel model
Experiments
Remarks
PhD Dissertation.18
19. An OR Gate-Based Text Classifier
Why using OR gates?
• The OR gate is a simple model (and fast for TC
Notation
inference). Representation
Particularities
• It has gained great success in knowledge Evaluation
Bayesian networks
representation. Definition
Storage problems
• Discriminative classifier (models directly p(ci |dj )) Canonical models
(NB generative). OR Gate
Models
• Seems natural the term are causes and category Experiments
Thesaurus
the effect. Definitions
Unsupervised model
Supervised model
Experiments
C C Structured
Structured documents
Transformations
Experiments
Link-based categorization
T1 T2 ... Tn T1 T2 ... Tn Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.19
20. An OR Gate-Based Text Classifier
The model
Equations for the OR gate classifier TC
Notation
Probability distributions: Representation
Particularities
Evaluation
pi (ci |pa(Ci )) = 1 − (1 − w(Tk , Ci )) , Bayesian networks
Definition
Tk ∈R(pa(Ci )) Storage problems
Canonical models
pi (c i |pa(Ci )) = 1 − pi (ci |pa(Ci )). OR Gate
Models
Experiments
Inference: Thesaurus
Definitions
Unsupervised model
pi (ci |dj ) = 1 − 1 − w(Tk , Ci ) p(tk |dj ) Supervised model
Experiments
Tk ∈Pa(Ci )
Structured
Structured documents
= 1− (1 − w(Tk , Ci )) . Transformations
Experiments
Tk ∈Pa(Ci )∩dj Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Model characterized by the w(Tk , Ci ) formula. Remarks
PhD Dissertation.20
21. An OR Gate-Based Text Classifier
Weight estimation formulae
Weight w(Tk , Ci ) means TC
Notation
Representation
ˆ
pi (ci |tk , t h ∀Th ∈ Pa(Ci ), Th = Tk ). Particularities
Evaluation
Bayesian networks
ˆ Nik +1
1 ML: w(Tk , Ci ) as p(ci |tk ), w(Tk , Ci ) = N•k +2 (using Definition
Storage problems
Laplace). Canonical models
OR Gate
Models
Experiments
2 TI: assuming term independence, given the Thesaurus
(Ni• −N
category, w(Tk , Ci ) = ntNik•k h=k (N−N•hih )N .
Definitions
Unsupervised model
iN )Ni• Supervised model
Experiments
Structured
Notation: N number of words in the training documents. Structured documents
Transformations
N•k times that term tk appears in training documents Experiments
Link-based categorization
(N•k = ci Nik ), Ni• is the total number of words in Multiclass model
Experiments
documents of class ci (Ni• = tk Nik ), M vocabulary size. Multilabel model
Experiments
Remarks
PhD Dissertation.21
22. An OR Gate-Based Text Classifier
Pruning independent terms
TC
Notation
Representation
Particularities
• The model can be improved pruning terms which Evaluation
are independent with the category. Bayesian networks
Definition
Storage problems
Canonical models
• We run an independence test for each pair OR Gate
Models
term/category, at a certain confidence level. Only Experiments
terms which pass it are kept. Thesaurus
Definitions
Unsupervised model
Supervised model
Experiments
• The size of the parent set is highly reduced, but Structured
classification is often improved. Structured documents
Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.22
23. An OR Gate-Based Text Classifier
Experiments I: Experimental setting
TC
Notation
Representation
Particularities
• We compare a Multinomial NB, a Rocchio, OR TI, Evaluation
OR ML, and both OR with pruning at levels Bayesian networks
Definition
{0.9, 0.99, 0.999}. Storage problems
Canonical models
OR Gate
Models
• We made experiments on Ohsumed-23, Reuters Experiments
and 20 NG. Thesaurus
Definitions
Unsupervised model
Supervised model
Experiments
• Soft categorization, evaluated with macro/micro Structured
BEP, 11 avg std prec, and macro/micro F1 @{1, 3, 5}. Structured documents
Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.23
24. An OR Gate-Based Text Classifier
Experiments II: Experimental setting
TC
Notation
Representation
Results (# of wins on 9 measures): Particularities
Evaluation
Bayesian networks
Definition
• Reuters: OR-TI-0.999 (7), OR-TI (1), OR-ML-0.999 Storage problems
Canonical models
(1). OR Gate
Models
Experiments
Thesaurus
• Ohsumed: OR-TI-0.999 (8), OR-ML-0.99 (1). Definitions
Unsupervised model
Supervised model
Experiments
• 20 NG: OR-TI (2), OR-ML (2), OR-TI-0.999 (1), NB Structured
(4). Structured documents
Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.24
25. An OR Gate-Based Text Classifier
Conclusions TC
Notation
Representation
• A new text categorization model, based on noisy OR Particularities
Evaluation
gates. Bayesian networks
Definition
• Simple and computationally affordable. Storage problems
Canonical models
• Results improves Naïve Bayes noticeably. OR Gate
Models
Experiments
Thesaurus
Future work Definitions
Unsupervised model
Supervised model
• Use more advanced noisy OR models (leaky). Experiments
Structured
• Use another canonical models. Structured documents
Transformations
• Explore another alternatives for term pruning. Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.25
26. Outline
1 Text Categorization
TC
Notation
Representation
Particularities
2 Bayesian networks Evaluation
Bayesian networks
Definition
Storage problems
3 An OR Gate-Based Text Classifier Canonical models
OR Gate
Models
Experiments
⇒ Automatic Indexing From a Thesaurus Using
Thesaurus
Bayesian Networks Definitions
Unsupervised model
Supervised model
Experiments
5 Structured Document Categorization Using Bayesian Structured
Structured documents
Networks Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
6 Final Remarks Multilabel model
Experiments
Remarks
PhD Dissertation.26
27. Automatic Indexing From a Thesaurus Using Bayesian
Networks: Thesauri I
Definitions
TC
A thesaurus Notation
Representation
A list of terms representing concepts, grouped those with Particularities
Evaluation
the same meaning, with hierarchical relationships among Bayesian networks
them. Definition
Storage problems
Canonical models
ND:psychiatric
OR Gate
ND:outpatient ND:clinic ND:hospital
hospital ND:health ND:dispensary Models
clinic
care centre Experiments
Thesaurus
Definitions
D:psychiatric D:medical D:medical
ND:medical Unsupervised model
institution institution centre
service Supervised model
Experiments
Structured
ND:health ND:health D:health
Structured documents
protection service
Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
D:health
Experiments
policy
Remarks
PhD Dissertation.27
28. Automatic Indexing From a Thesaurus Using Bayesian
Networks: Thesauri II
Indexing with a thesauri
TC
• Problem: associating descriptors (keywords) to Notation
Representation
documents (scientific, medical, legal,. . . ). Particularities
Evaluation
• Manually: expensive and time consuming work (due Bayesian networks
to thousand of descriptors! [EUROVOC > 6000]). Definition
Storage problems
Besides: Canonical models
OR Gate
• How many descriptors should we assign? Models
• Which descriptor should assign in the hierarchy? Experiments
Thesaurus
Definitions
We propose an automatic solution Unsupervised model
Supervised model
Experiments
1 TC problem (categories ≡ descriptors). Structured
Structured documents
2 Makes use of the meta-information of the thesaurus. Transformations
Experiments
Link-based categorization
3 Unsupervised and supervised case. Multiclass model
Experiments
4 Based on BNs. Multilabel model
Experiments
Remarks
PhD Dissertation.28
29. Automatic Indexing From a Thesaurus Using Bayesian
Networks: Unsupervised case I
From the thesaurus...
TC
Notation
Representation
Particularities
Evaluation
ND:psychiatric ND:outpatient ND:clinic ND:hospital
hospital ND:health ND:dispensary
clinic
care centre Bayesian networks
Definition
Storage problems
D:psychiatric D:medical D:medical
ND:medical centre Canonical models
institution institution
service
OR Gate
Models
ND:health ND:health D:health
protection service Experiments
Thesaurus
Definitions
Unsupervised model
D:health Supervised model
policy
Experiments
Structured
Structured documents
Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.29
30. Automatic Indexing From a Thesaurus Using Bayesian
Networks: Unsupervised case II
... to the Bayesian network.
TC
policy protection service health psychiatric outpatient hospital clinic institution medical care centre dispensary Notation
Representation
Particularities
Evaluation
ND:psychiatric ND:outpatient ND:clinic ND:hospital
hospital ND:health ND:dispensary
clinic
care centre Bayesian networks
Definition
Storage problems
D:psychiatric D:medical D:medical
ND:medical Canonical models
institution institution centre
service
OR Gate
Models
ND:health ND:health D:health
protection service Experiments
Thesaurus
Definitions
Unsupervised model
D:health Supervised model
policy
Experiments
Structured
Structured documents
Transformations
• Binary variables (relevant/not relevant), for each Experiments
Link-based categorization
term, descriptor and non descriptor. Multiclass model
Experiments
• Problem: information of different nature mixed in Multilabel model
Experiments
descriptor nodes! Remarks
PhD Dissertation.30
31. Automatic Indexing From a Thesaurus Using Bayesian
Networks: Unsupervised case III
Introducing concept and virtual nodes
TC
protection policy health service care dispensary centre medical outpatient clinic hospital institution psychiatric
Notation
Representation
Particularities
ND:health ND:health D:health ND:medical D:health ND:health ND:dispensary D:medical D:medical ND:outpatient ND:clinic ND:hospital D:psychiatric ND:psychiatric
protection policy service service care centre centre institution clinic institution hospital Evaluation
Bayesian networks
E:health E:health E:medical E:medical E:psychiatric Definition
policy service centre institution institution
Storage problems
Canonical models
C:medical C:medical C:psychiatric
centre institution institution OR Gate
Models
Experiments
H:health
service
C:health Thesaurus
service
Definitions
Unsupervised model
H:health
policy Supervised model
Experiments
C:health
policy
Structured
Structured documents
Transformations
Experiments
• New concept nodes, C. Link-based categorization
Multiclass model
• New Hierarchy nodes, HC . Experiments
Multilabel model
Experiments
• New Equivalence nodes, EC . Remarks
PhD Dissertation.31
32. Automatic Indexing From a Thesaurus Using Bayesian
Networks: Unsupervised case IV
Probability distributions
We already have the structure, but we have to specify TC
Notation
the distributions on each node. Representation
Particularities
Many parents ⇒ canonical models. Evaluation
Bayesian networks
• D and ND nodes (SUM): Definition
Storage problems
p(d + |pa(D)) = T ∈R(pa(D)) w(T , D).
Canonical models
OR Gate
• HC nodes (SUM): Models
+ Experiments
p(hc |pa(HC )) = C ∈R(pa(HC )) w(C , HC ). Thesaurus
• EC nodes (OR): Definitions
Unsupervised model
+
p(ec |pa(EC )) = 1 − D∈R(pa(EC )) (1 − w(D, C)).
Supervised model
Experiments
• C nodes (OR): Structured
Structured documents
Transformations
+ +
8
> 1 − (1 − w(EC , C))(1 − w(HC , C))
> if ec = ec , hc = hc Experiments
>
< w(E , C) + − Link-based categorization
+ C if ec = ec , hc = hc
p(c |{ec , hc }) = − + Multiclass model
> w(HC , C)
> if ec = ec , hc = hc Experiments
> − −
0 if ec = ec , hc = hc
:
Multilabel model
Experiments
Remarks
PhD Dissertation.32
33. Automatic Indexing From a Thesaurus Using Bayesian
Networks: Unsupervised case V
Inference
TC
Classification is seen as inference. Notation
Representation
Particularities
• Given a document d, we set the term variables Evaluation
T ∈ d to t + , t − otherwise. Bayesian networks
Definition
Storage problems
Canonical models
• Exact propagation is carried out: OR Gate
Models
• First, to descriptor nodes. Experiments
Thesaurus
Definitions
• Then, to EC nodes. Unsupervised model
Supervised model
Experiments
• Following a topological order, probabilities Structured
+
p(hc |pa(HC )) are computed after their parents Structured documents
Transformations
+
p(c |pa(C)) are set. Experiments
Link-based categorization
Multiclass model
Experiments
• Final p(c + |pa(C)), ∀C values are returned. Multilabel model
Experiments
Remarks
PhD Dissertation.33
34. Automatic Indexing From a Thesaurus Using Bayesian
Networks: Supervised model I
Changes from the unsupervised case
TC
Notation
Representation
Particularities
Evaluation
• Concept also receives information from labeled Bayesian networks
Definition
documents in the training set. Storage problems
Canonical models
OR Gate
Models
• We add a training node TC as new parent of the Experiments
concept one. The node is an OR gate ML classifier. Thesaurus
Definitions
Unsupervised model
Supervised model
Experiments
• Distributions of C nodes are modified Structured
consequently. Structured documents
Transformations
Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.34
35. Automatic Indexing From a Thesaurus Using Bayesian
Networks: Supervised model II
Graphically: unsupervised
TC
Notation
Representation
Particularities
Evaluation
protection policy health service care dispensary centre medical outpatient clinic hospital institution psychiatric
Bayesian networks
Definition
ND:health ND:health D:health ND:medical D:health ND:health ND:dispensary D:medical D:medical ND:outpatient ND:clinic ND:hospital D:psychiatric ND:psychiatric
protection policy service service care centre centre institution clinic institution hospital Storage problems
Canonical models
E:health
policy
E:health
service
E:medical
centre
E:medical
institution
E:psychiatric OR Gate
institution
Models
Experiments
C:medical C:medical C:psychiatric
centre institution institution
Thesaurus
Definitions
H:health Unsupervised model
service
Supervised model
C:health
service Experiments
H:health Structured
policy
Structured documents
C:health Transformations
policy
Experiments
Link-based categorization
Multiclass model
Experiments
Multilabel model
Experiments
Remarks
PhD Dissertation.35