SlideShare a Scribd company logo
1 of 69
Download to read offline
PhD Dissertation
Document Classification Models                         TC
                                                      Notation
                                                      Representation

based on Bayesian networks                            Particularities
                                                      Evaluation


Alfonso E. Romero                                     Bayesian networks
                                                      Definition
                                                      Storage problems
                                                      Canonical models
April 27, 2010
                                                      OR Gate
                                                      Models
                                                      Experiments

                                                      Thesaurus
                                                      Definitions
                                                      Unsupervised model
                                                      Supervised model
                                                      Experiments

                                                      Structured
                                                      Structured documents

                   Advisors: Luis M. de Campos, and   Transformations
                                                      Experiments

                            Juan M. Fernández-Luna    Link-based categorization
                                                      Multiclass model

            Department of Computer Science and A.I.   Experiments
                                                      Multilabel model

                              University of Granada   Experiments

                                                      Remarks
                                                               PhD Dissertation.1
A brief overview: the problem I


 We shall solve problems in document categorization...
                                                                    TC
                                                                    Notation
                                                                    Representation
                                                                    Particularities
                                                                    Evaluation
      New HSL                                     New HSL
      Madrid-Valencia                             Madrid-Valencia   Bayesian networks
      to open in
      December, 2010              ?               to open in
                                                  December, 2010
                                                                    Definition
                                                                    Storage problems
                                                                    Canonical models

                                              National              OR Gate
                                                                    Models
                                                                    Experiments

                                                                    Thesaurus
                                                                    Definitions
                                                                    Unsupervised model
                                                                    Supervised model
                                                                    Experiments

                                                                    Structured
                                                                    Structured documents

          Sports        Society       World       National          Transformations
                                                                    Experiments
                                                                    Link-based categorization
                                                                    Multiclass model
                                                                    Experiments
                                                                    Multilabel model
                                                                    Experiments

                                                                    Remarks
                                                                             PhD Dissertation.2
A brief overview: the problem II


 ... in particular, automatic document categorization...
                                                                    TC
                                                                    Notation
                                                                    Representation
                                                                    Particularities
                                                                    Evaluation
      New HSL                                     New HSL
      Madrid-Valencia                             Madrid-Valencia   Bayesian networks
      to open in
      December, 2010              ?               to open in
                                                  December, 2010
                                                                    Definition
                                                                    Storage problems
                                                                    Canonical models

                                              National              OR Gate
                                                                    Models
                                                                    Experiments

                                                                    Thesaurus
                                                                    Definitions
                                                                    Unsupervised model
                                                                    Supervised model
                                                                    Experiments

                                                                    Structured
                                                                    Structured documents

          Sports        Society       World       National          Transformations
                                                                    Experiments
                                                                    Link-based categorization
                                                                    Multiclass model
                                                                    Experiments
                                                                    Multilabel model
                                                                    Experiments

                                                                    Remarks
                                                                             PhD Dissertation.3
A brief overview: the problem III


 ... concretely, supervised document categorization.
                                                                                  TC
                                                                                  Notation
                                                                                  Representation
                                                                                  Particularities
    The Alhambra of                  Learning procedure                           Evaluation
    Granada, most
    visited
                          National                                                Bayesian networks
    monument in 2009                                                              Definition
                                                                                  Storage problems
                                                                                  Canonical models

    Prime minister                                                                OR Gate
    Zapatero to give a                                    Algorithm (classifier)   Models
    speech on the         National                                                Experiments

    Parliament
                                                                                  Thesaurus
                                                                                  Definitions
                                                                                  Unsupervised model
    AVE train to run at                                                           Supervised model
    350 km/h in 2010                                                              Experiments

    from Madrid to        National
                                                                                  Structured
    Barcelona                                                                     Structured documents
                                                                                  Transformations
                                                                                  Experiments
     Labeled corpus
                                                                                  Link-based categorization
                                                                                  Multiclass model
                                                                                  Experiments
                                                                                  Multilabel model
                                                                                  Experiments

                                                                                  Remarks
                                                                                           PhD Dissertation.4
A brief overview: the methods I


 Which learning method shall we use?
                                       TC
                                       Notation
                                       Representation



   Neural networks
                                       Particularities
                                       Evaluation




    Support Vector Machines
                                       Bayesian networks
                                       Definition
                                       Storage problems



   k-NN�methods
                                       Canonical models

                                       OR Gate

        Bayesian networks              Models
                                       Experiments




        and probabilistic methods
                                       Thesaurus
                                       Definitions
                                       Unsupervised model



  Decision trees
                                       Supervised model
                                       Experiments

                                       Structured

           Evolutive algorithms        Structured documents
                                       Transformations
                                       Experiments
                                       Link-based categorization
                                       Multiclass model
                                       Experiments
                                       Multilabel model
                                       Experiments

                                       Remarks
                                                PhD Dissertation.5
A brief overview: the methods II


 The answer:
                                    TC
                                    Notation
                                    Representation



   Neural networks
                                    Particularities
                                    Evaluation




    Support Vector Machines
                                    Bayesian networks
                                    Definition
                                    Storage problems


   k-NN�methods
                                    Canonical models

                                    OR Gate

        Bayesian networks           Models
                                    Experiments




        and probabilistic methods
                                    Thesaurus
                                    Definitions
                                    Unsupervised model



  Decision trees
                                    Supervised model
                                    Experiments




           Evolutive algorithms
                                    Structured
                                    Structured documents
                                    Transformations
                                    Experiments
                                    Link-based categorization
                                    Multiclass model
                                    Experiments
                                    Multilabel model
                                    Experiments

                                    Remarks
                                             PhD Dissertation.6
A brief overview: the methods III


                                                           TC
 But, why?                                                 Notation
                                                           Representation
                                                           Particularities
                                                           Evaluation

                                                           Bayesian networks
   • Strong theoretical foundation (probability theory).   Definition
                                                           Storage problems
                                                           Canonical models

                                                           OR Gate
   • Models for (uncertain) knowledge representation.      Models
                                                           Experiments

                                                           Thesaurus
                                                           Definitions

   • Great success in related tasks (IR).                  Unsupervised model
                                                           Supervised model
                                                           Experiments

                                                           Structured
   • Our background at the group UTAI.                     Structured documents
                                                           Transformations
                                                           Experiments
                                                           Link-based categorization
                                                           Multiclass model
                                                           Experiments
                                                           Multilabel model
                                                           Experiments

                                                           Remarks
                                                                    PhD Dissertation.7
Outline


  1   Text Categorization
                                                          TC
                                                          Notation
                                                          Representation
                                                          Particularities
  2   Bayesian networks                                   Evaluation

                                                          Bayesian networks
                                                          Definition
                                                          Storage problems
  3   An OR Gate-Based Text Classifier                     Canonical models

                                                          OR Gate
                                                          Models
                                                          Experiments
  4   Automatic Indexing From a Thesaurus Using
                                                          Thesaurus
      Bayesian Networks                                   Definitions
                                                          Unsupervised model
                                                          Supervised model
                                                          Experiments


  5   Structured Document Categorization Using Bayesian   Structured
                                                          Structured documents
      Networks                                            Transformations
                                                          Experiments
                                                          Link-based categorization
                                                          Multiclass model
                                                          Experiments
  6   Final Remarks                                       Multilabel model
                                                          Experiments

                                                          Remarks
                                                                   PhD Dissertation.8
Outline


  ⇒ Text Categorization
                                                          TC
                                                          Notation
                                                          Representation
                                                          Particularities
  2   Bayesian networks                                   Evaluation

                                                          Bayesian networks
                                                          Definition
                                                          Storage problems
  3   An OR Gate-Based Text Classifier                     Canonical models

                                                          OR Gate
                                                          Models
                                                          Experiments
  4   Automatic Indexing From a Thesaurus Using
                                                          Thesaurus
      Bayesian Networks                                   Definitions
                                                          Unsupervised model
                                                          Supervised model
                                                          Experiments


  5   Structured Document Categorization Using Bayesian   Structured
                                                          Structured documents
      Networks                                            Transformations
                                                          Experiments
                                                          Link-based categorization
                                                          Multiclass model
                                                          Experiments
  6   Final Remarks                                       Multilabel model
                                                          Experiments

                                                          Remarks
                                                                   PhD Dissertation.9
Supervised Text Categorization I


 Provided
                                                         TC
                                                         Notation

   1   Set of labeled documents DTr (training).          Representation
                                                         Particularities

   2   C, set of categories/labels.                      Evaluation

                                                         Bayesian networks

 The goal is to build a model f (classifier) capable of   Definition
                                                         Storage problems

 predicting categories (of C) of documents in D.         Canonical models

                                                         OR Gate
                                                         Models

 Different kinds of labeling:                            Experiments

                                                         Thesaurus
   • f : D → {c, c} (binary).                            Definitions
                                                         Unsupervised model

   • f : D → {c1 , c2 , . . . , cn } (multiclass).       Supervised model
                                                         Experiments


   • f : D × C → {0, 1} (multilabel).                    Structured
                                                         Structured documents
                                                         Transformations

 A multilabel problem reduces to |C| binary problems     Experiments
                                                         Link-based categorization

 C = {c, c}. We often change the codomain from {0, 1}    Multiclass model
                                                         Experiments
 (hard classification) to [0, 1] (soft classification).    Multilabel model
                                                         Experiments

                                                         Remarks
                                                               PhD Dissertation.10
Supervised Text Categorization II
Document representation:
 As in Information Retrieval
                                                                                                         TC
       Stopwords removal + stemming + Vector                                                             Notation

       representation (Frequency, binary or tf-idf).                                                     Representation
                                                                                                         Particularities
                                                                                                         Evaluation


 From (preprocessed) document to vector                                                                  Bayesian networks
                                                                                                         Definition
                                                                                                         Storage problems
                                                                                                         Canonical models

                                   term ⇔ dimension                                                      OR Gate
                                                                                                         Models
                                                                                                         Experiments

                                                                                                         Thesaurus
                                                                                                         Definitions
                                                                                                         Unsupervised model
   Example (beginning of John Milton's "Lost Paradise"):                                                 Supervised model
                                                                                                         Experiments

   Of Mans First Disobedience, and the Fruit            Of Mans First Disobedience, and the Fruit
                                                                                                         Structured
   Of that Forbidden Tree, whose mortal tast            Of that Forbidden Tree, whose mortal tast
                                                                                                         Structured documents
   Brought Death into the World, and all our woe,       Brought Death into the World, and all our woe,
   With loss of Eden, till one greater Man...           With loss of Eden, till one greater Man...       Transformations
                                                                                                         Experiments
                                                                                                         Link-based categorization
                                                                                                         Multiclass model
       2        1       1      1      1      1      1      1      1      1      1     1     1      1
                                                                                                         Experiments
      man obedience fruit forbid tree mortal tast bring death world woe loss eden great                  Multilabel model
                                                                                                         Experiments

                                                                                                         Remarks
                                                                                                               PhD Dissertation.11
Supervised Text Categorization III
What are the particularities of the problem?


 It differs from a “classic” Machine Learning problem in:   TC
                                                            Notation
                                                            Representation
                                                            Particularities
                                                            Evaluation
    • High dimensionality (easily > 10000).
                                                            Bayesian networks
                                                            Definition
                                                            Storage problems
                                                            Canonical models
    • Very unbalanced datasets.
                                                            OR Gate
                                                            Models
                                                            Experiments


    • |C|    0.                                             Thesaurus
                                                            Definitions
                                                            Unsupervised model
                                                            Supervised model
                                                            Experiments
    • Sometimes, there is a hierarchy in the set C.
                                                            Structured
                                                            Structured documents
                                                            Transformations
                                                            Experiments
    • Sometimes, explicit relationships among               Link-based categorization
                                                            Multiclass model
      documents are given.                                  Experiments
                                                            Multilabel model
                                                            Experiments

                                                            Remarks
                                                                  PhD Dissertation.12
Supervised Text Categorization IV
Evaluation

 How to measure the correctness of documents
                                                                 TC
 assigned to set of categories?                                  Notation
                                                                 Representation
   • Binary/multiclass:                                          Particularities
                                                                 Evaluation


                                             TP                  Bayesian networks
        • Hard categorization Precision:   TP+FP   and Recall:   Definition
                                                                 Storage problems
               TP             2PR
             TP+FN ,   F1 :   P+R .                              Canonical models

                                                                 OR Gate
        • Soft categorization Precision/Recall BEP.              Models
                                                                 Experiments

                                                                 Thesaurus
                                                                 Definitions

   • Multilabel: micro and macro averages.                       Unsupervised model
                                                                 Supervised model
                                                                 Experiments

                                                                 Structured
   • Also, average precision on the 11 std. recall points        Structured documents
                                                                 Transformations

      (category ranking).                                        Experiments
                                                                 Link-based categorization
                                                                 Multiclass model
                                                                 Experiments
                                                                 Multilabel model

   • Standard corpora: Reuters, Ohsumed, 20 NG. . .              Experiments

                                                                 Remarks
                                                                       PhD Dissertation.13
Outline


  1   Text Categorization
                                                          TC
                                                          Notation
                                                          Representation

  ⇒ Bayesian networks                                     Particularities
                                                          Evaluation

                                                          Bayesian networks
                                                          Definition
                                                          Storage problems
  3   An OR Gate-Based Text Classifier                     Canonical models

                                                          OR Gate
                                                          Models
                                                          Experiments
  4   Automatic Indexing From a Thesaurus Using
                                                          Thesaurus
      Bayesian Networks                                   Definitions
                                                          Unsupervised model
                                                          Supervised model
                                                          Experiments


  5   Structured Document Categorization Using Bayesian   Structured
                                                          Structured documents
      Networks                                            Transformations
                                                          Experiments
                                                          Link-based categorization
                                                          Multiclass model
                                                          Experiments
  6   Final Remarks                                       Multilabel model
                                                          Experiments

                                                          Remarks
                                                                PhD Dissertation.14
Bayesian networks I
Definition and characteristics
 A set of random variables X1 , . . . , XN in a DAG, verifying
 P(X1 , . . . , Xn ) = n P(Xi |Pa(Xi )) ⇒ the graph
                       i=1                                       TC
                                                                 Notation
 represents independences.                                       Representation
                                                                 Particularities
                                                                 Evaluation


 Causal interpretation.                                          Bayesian networks
                                                                 Definition
                                                                 Storage problems
                                                                 Canonical models

 Learning and inference methods available.                       OR Gate
                                                                 Models
                                                                 Experiments

                                                                 Thesaurus
                                                                 Definitions
                                                                 Unsupervised model
                                                                 Supervised model
                                                                 Experiments

                                                                 Structured
                                                                 Structured documents
                                                                 Transformations
                                                                 Experiments
                                                                 Link-based categorization
                                                                 Multiclass model
                                                                 Experiments
                                                                 Multilabel model
                                                                 Experiments

                                                                 Remarks
                                                                       PhD Dissertation.15
Bayesian networks III
Estimation and storage problems


                                                             TC
 The problem                                                 Notation
                                                             Representation
                                                             Particularities

   • One value for each configuration of the parents.         Evaluation

                                                             Bayesian networks
   • General case: exponential number of parameters          Definition
                                                             Storage problems
       on the number of parents.                             Canonical models

                                                             OR Gate
                                                             Models
                                                             Experiments
 The solution                                                Thesaurus
                                                             Definitions

   1   Few parents per node (not realistic in text).         Unsupervised model
                                                             Supervised model
                                                             Experiments
   2   Write the probability of a node as a deterministic    Structured
       function of the configuration (canonical model). Set   Structured documents
                                                             Transformations

       of parameters with linear size on the number of       Experiments
                                                             Link-based categorization

       parents.                                              Multiclass model
                                                             Experiments
                                                             Multilabel model
                                                             Experiments

                                                             Remarks
                                                                   PhD Dissertation.16
Bayesian networks: canonical models
Components and examples of canonical models


   • X = {Xi }: parents (causes), Y : child (effect).               TC
                                                                    Notation
                                                                    Representation
   • Xi in {xi , xi }, Yi in {yi , yi } (occurrence or not).        Particularities
                                                                    Evaluation

                                                                    Bayesian networks
   1   Noisy-OR gate model:                                         Definition

       p(y |X) = 1 − Xi ∈R(x) (1 − wOR (Xi , Y )).                  Storage problems
                                                                    Canonical models


   2   Additive model: p(y |X) =         Xi ∈R(x) wadd (Xi , Y ).
                                                                    OR Gate
                                                                    Models
                                                                    Experiments

                                                                    Thesaurus

                                       ...
                                                                    Definitions


             X1        X2                             XL
                                                                    Unsupervised model
                                                                    Supervised model
                                                                    Experiments

                                                                    Structured
                                                                    Structured documents
                                                                    Transformations
                                                                    Experiments


                                   Y                                Link-based categorization
                                                                    Multiclass model
                                                                    Experiments
                                                                    Multilabel model
                                                                    Experiments

                                                                    Remarks
                                                                          PhD Dissertation.17
Outline


  1   Text Categorization
                                                          TC
                                                          Notation
                                                          Representation
                                                          Particularities
  2   Bayesian networks                                   Evaluation

                                                          Bayesian networks
                                                          Definition


  ⇒ An OR Gate-Based Text Classifier                       Storage problems
                                                          Canonical models

                                                          OR Gate
                                                          Models
                                                          Experiments
  4   Automatic Indexing From a Thesaurus Using
                                                          Thesaurus
      Bayesian Networks                                   Definitions
                                                          Unsupervised model
                                                          Supervised model
                                                          Experiments


  5   Structured Document Categorization Using Bayesian   Structured
                                                          Structured documents
      Networks                                            Transformations
                                                          Experiments
                                                          Link-based categorization
                                                          Multiclass model
                                                          Experiments
  6   Final Remarks                                       Multilabel model
                                                          Experiments

                                                          Remarks
                                                                PhD Dissertation.18
An OR Gate-Based Text Classifier
Why using OR gates?


   • The OR gate is a simple model (and fast for             TC
                                                             Notation
     inference).                                             Representation
                                                             Particularities

   • It has gained great success in knowledge                Evaluation

                                                             Bayesian networks
     representation.                                         Definition
                                                             Storage problems
   • Discriminative classifier (models directly p(ci |dj ))   Canonical models


     (NB generative).                                        OR Gate
                                                             Models

   • Seems natural the term are causes and category          Experiments

                                                             Thesaurus
     the effect.                                             Definitions
                                                             Unsupervised model
                                                             Supervised model
                                                             Experiments

                      C                  C                   Structured
                                                             Structured documents
                                                             Transformations
                                                             Experiments
                                                             Link-based categorization

             T1    T2     ... Tn   T1   T2   ... Tn          Multiclass model
                                                             Experiments
                                                             Multilabel model
                                                             Experiments

                                                             Remarks
                                                                   PhD Dissertation.19
An OR Gate-Based Text Classifier
The model

 Equations for the OR gate classifier                            TC
                                                                Notation
 Probability distributions:                                     Representation
                                                                Particularities
                                                                Evaluation


       pi (ci |pa(Ci )) = 1 −             (1 − w(Tk , Ci )) ,   Bayesian networks
                                                                Definition
                              Tk ∈R(pa(Ci ))                    Storage problems
                                                                Canonical models

      pi (c i |pa(Ci )) = 1 − pi (ci |pa(Ci )).                 OR Gate
                                                                Models
                                                                Experiments
 Inference:                                                     Thesaurus
                                                                Definitions
                                                                Unsupervised model
       pi (ci |dj ) = 1 −        1 − w(Tk , Ci ) p(tk |dj )     Supervised model
                                                                Experiments
                       Tk ∈Pa(Ci )
                                                                Structured
                                                                Structured documents
                 = 1−                (1 − w(Tk , Ci )) .        Transformations
                                                                Experiments
                       Tk ∈Pa(Ci )∩dj                           Link-based categorization
                                                                Multiclass model
                                                                Experiments
                                                                Multilabel model
                                                                Experiments
 Model characterized by the w(Tk , Ci ) formula.                Remarks
                                                                      PhD Dissertation.20
An OR Gate-Based Text Classifier
Weight estimation formulae


 Weight w(Tk , Ci ) means                                               TC
                                                                        Notation
                                                                        Representation
 ˆ
 pi (ci |tk , t h ∀Th ∈ Pa(Ci ), Th = Tk ).                             Particularities
                                                                        Evaluation

                                                                        Bayesian networks
                          ˆ                           Nik +1
   1   ML: w(Tk , Ci ) as p(ci |tk ), w(Tk , Ci ) =   N•k +2   (using   Definition
                                                                        Storage problems

       Laplace).                                                        Canonical models

                                                                        OR Gate
                                                                        Models
                                                                        Experiments

   2   TI: assuming term independence, given the                        Thesaurus
                                           (Ni• −N
       category, w(Tk , Ci ) = ntNik•k h=k (N−N•hih )N .
                                                                        Definitions
                                                                        Unsupervised model
                                 iN                )Ni•                 Supervised model
                                                                        Experiments

                                                                        Structured
 Notation: N number of words in the training documents.                 Structured documents
                                                                        Transformations

 N•k times that term tk appears in training documents                   Experiments
                                                                        Link-based categorization

 (N•k = ci Nik ), Ni• is the total number of words in                   Multiclass model
                                                                        Experiments

 documents of class ci (Ni• = tk Nik ), M vocabulary size.              Multilabel model
                                                                        Experiments

                                                                        Remarks
                                                                              PhD Dissertation.21
An OR Gate-Based Text Classifier
Pruning independent terms


                                                         TC
                                                         Notation
                                                         Representation
                                                         Particularities
   • The model can be improved pruning terms which       Evaluation


     are independent with the category.                  Bayesian networks
                                                         Definition
                                                         Storage problems
                                                         Canonical models


   • We run an independence test for each pair           OR Gate
                                                         Models

     term/category, at a certain confidence level. Only   Experiments


     terms which pass it are kept.                       Thesaurus
                                                         Definitions
                                                         Unsupervised model
                                                         Supervised model
                                                         Experiments
   • The size of the parent set is highly reduced, but   Structured
     classification is often improved.                    Structured documents
                                                         Transformations
                                                         Experiments
                                                         Link-based categorization
                                                         Multiclass model
                                                         Experiments
                                                         Multilabel model
                                                         Experiments

                                                         Remarks
                                                               PhD Dissertation.22
An OR Gate-Based Text Classifier
Experiments I: Experimental setting


                                                             TC
                                                             Notation
                                                             Representation
                                                             Particularities
   • We compare a Multinomial NB, a Rocchio, OR TI,          Evaluation


      OR ML, and both OR with pruning at levels              Bayesian networks
                                                             Definition
      {0.9, 0.99, 0.999}.                                    Storage problems
                                                             Canonical models

                                                             OR Gate
                                                             Models
   • We made experiments on Ohsumed-23, Reuters              Experiments


      and 20 NG.                                             Thesaurus
                                                             Definitions
                                                             Unsupervised model
                                                             Supervised model
                                                             Experiments
   • Soft categorization, evaluated with macro/micro         Structured
      BEP, 11 avg std prec, and macro/micro F1 @{1, 3, 5}.   Structured documents
                                                             Transformations
                                                             Experiments
                                                             Link-based categorization
                                                             Multiclass model
                                                             Experiments
                                                             Multilabel model
                                                             Experiments

                                                             Remarks
                                                                   PhD Dissertation.23
An OR Gate-Based Text Classifier
Experiments II: Experimental setting


                                                        TC
                                                        Notation
                                                        Representation

 Results (# of wins on 9 measures):                     Particularities
                                                        Evaluation

                                                        Bayesian networks
                                                        Definition

   • Reuters: OR-TI-0.999 (7), OR-TI (1), OR-ML-0.999   Storage problems
                                                        Canonical models

      (1).                                              OR Gate
                                                        Models
                                                        Experiments

                                                        Thesaurus
   • Ohsumed: OR-TI-0.999 (8), OR-ML-0.99 (1).          Definitions
                                                        Unsupervised model
                                                        Supervised model
                                                        Experiments

   • 20 NG: OR-TI (2), OR-ML (2), OR-TI-0.999 (1), NB   Structured
      (4).                                              Structured documents
                                                        Transformations
                                                        Experiments
                                                        Link-based categorization
                                                        Multiclass model
                                                        Experiments
                                                        Multilabel model
                                                        Experiments

                                                        Remarks
                                                              PhD Dissertation.24
An OR Gate-Based Text Classifier


 Conclusions                                              TC
                                                          Notation
                                                          Representation

   • A new text categorization model, based on noisy OR   Particularities
                                                          Evaluation

     gates.                                               Bayesian networks
                                                          Definition
   • Simple and computationally affordable.               Storage problems
                                                          Canonical models

   • Results improves Naïve Bayes noticeably.             OR Gate
                                                          Models
                                                          Experiments

                                                          Thesaurus
 Future work                                              Definitions
                                                          Unsupervised model
                                                          Supervised model

   • Use more advanced noisy OR models (leaky).           Experiments

                                                          Structured
   • Use another canonical models.                        Structured documents
                                                          Transformations

   • Explore another alternatives for term pruning.       Experiments
                                                          Link-based categorization
                                                          Multiclass model
                                                          Experiments
                                                          Multilabel model
                                                          Experiments

                                                          Remarks
                                                                PhD Dissertation.25
Outline


  1   Text Categorization
                                                          TC
                                                          Notation
                                                          Representation
                                                          Particularities
  2   Bayesian networks                                   Evaluation

                                                          Bayesian networks
                                                          Definition
                                                          Storage problems
  3   An OR Gate-Based Text Classifier                     Canonical models

                                                          OR Gate
                                                          Models
                                                          Experiments
  ⇒ Automatic Indexing From a Thesaurus Using
                                                          Thesaurus
    Bayesian Networks                                     Definitions
                                                          Unsupervised model
                                                          Supervised model
                                                          Experiments


  5   Structured Document Categorization Using Bayesian   Structured
                                                          Structured documents
      Networks                                            Transformations
                                                          Experiments
                                                          Link-based categorization
                                                          Multiclass model
                                                          Experiments
  6   Final Remarks                                       Multilabel model
                                                          Experiments

                                                          Remarks
                                                                PhD Dissertation.26
Automatic Indexing From a Thesaurus Using Bayesian
Networks: Thesauri I
Definitions
                                                                                                                        TC
 A thesaurus                                                                                                            Notation
                                                                                                                        Representation
 A list of terms representing concepts, grouped those with                                                              Particularities
                                                                                                                        Evaluation
 the same meaning, with hierarchical relationships among                                                                Bayesian networks
 them.                                                                                                                  Definition
                                                                                                                        Storage problems
                                                                                                                        Canonical models


                           ND:psychiatric
                                                                                                                        OR Gate
                                            ND:outpatient   ND:clinic ND:hospital
                             hospital                                               ND:health           ND:dispensary   Models
                                              clinic
                                                                                    care centre                         Experiments

                                                                                                                        Thesaurus
                                                                                                                        Definitions
                                     D:psychiatric          D:medical                       D:medical
                 ND:medical                                                                                             Unsupervised model
                                       institution            institution                     centre
                  service                                                                                               Supervised model
                                                                                                                        Experiments

                                                                                                                        Structured
 ND:health   ND:health                               D:health
                                                                                                                        Structured documents
              protection                              service
                                                                                                                        Transformations
                                                                                                                        Experiments
                                                                                                                        Link-based categorization
                                                                                                                        Multiclass model
                                                                                                                        Experiments
                                                                                                                        Multilabel model
                       D:health
                                                                                                                        Experiments
                        policy
                                                                                                                        Remarks
                                                                                                                              PhD Dissertation.27
Automatic Indexing From a Thesaurus Using Bayesian
Networks: Thesauri II
Indexing with a thesauri
                                                              TC
    • Problem: associating descriptors (keywords) to          Notation
                                                              Representation

      documents (scientific, medical, legal,. . . ).           Particularities
                                                              Evaluation

    • Manually: expensive and time consuming work (due        Bayesian networks
      to thousand of descriptors! [EUROVOC > 6000]).          Definition
                                                              Storage problems

      Besides:                                                Canonical models

                                                              OR Gate
         • How many descriptors should we assign?             Models

         • Which descriptor should assign in the hierarchy?   Experiments

                                                              Thesaurus
                                                              Definitions

 We propose an automatic solution                             Unsupervised model
                                                              Supervised model
                                                              Experiments


   1   TC problem (categories ≡ descriptors).                 Structured
                                                              Structured documents

   2   Makes use of the meta-information of the thesaurus.    Transformations
                                                              Experiments
                                                              Link-based categorization
   3   Unsupervised and supervised case.                      Multiclass model
                                                              Experiments

   4   Based on BNs.                                          Multilabel model
                                                              Experiments

                                                              Remarks
                                                                    PhD Dissertation.28
Automatic Indexing From a Thesaurus Using Bayesian
Networks: Unsupervised case I
From the thesaurus...
                                                                                                                                     TC
                                                                                                                                     Notation
                                                                                                                                     Representation
                                                                                                                                     Particularities
                                                                                                                                     Evaluation
                                        ND:psychiatric   ND:outpatient   ND:clinic ND:hospital
                                          hospital                                               ND:health           ND:dispensary
                                                           clinic
                                                                                                 care centre                         Bayesian networks
                                                                                                                                     Definition
                                                                                                                                     Storage problems
                                                  D:psychiatric          D:medical                       D:medical
                              ND:medical                                                                   centre                    Canonical models
                                                    institution            institution
                               service
                                                                                                                                     OR Gate
                                                                                                                                     Models
              ND:health   ND:health                               D:health
                           protection                              service                                                           Experiments

                                                                                                                                     Thesaurus
                                                                                                                                     Definitions
                                                                                                                                     Unsupervised model
                                    D:health                                                                                         Supervised model
                                     policy
                                                                                                                                     Experiments

                                                                                                                                     Structured
                                                                                                                                     Structured documents
                                                                                                                                     Transformations
                                                                                                                                     Experiments
                                                                                                                                     Link-based categorization
                                                                                                                                     Multiclass model
                                                                                                                                     Experiments
                                                                                                                                     Multilabel model
                                                                                                                                     Experiments

                                                                                                                                     Remarks
                                                                                                                                           PhD Dissertation.29
Automatic Indexing From a Thesaurus Using Bayesian
Networks: Unsupervised case II
... to the Bayesian network.
                                                                                                                                                                      TC
  policy   protection   service      health       psychiatric      outpatient     hospital        clinic   institution   medical     care       centre   dispensary   Notation
                                                                                                                                                                      Representation
                                                                                                                                                                      Particularities
                                                                                                                                                                      Evaluation
                                                            ND:psychiatric      ND:outpatient   ND:clinic ND:hospital
                                                              hospital                                                      ND:health           ND:dispensary
                                                                                  clinic
                                                                                                                            care centre                               Bayesian networks
                                                                                                                                                                      Definition
                                                                                                                                                                      Storage problems
                                                                         D:psychiatric          D:medical                           D:medical
                                                  ND:medical                                                                                                          Canonical models
                                                                           institution            institution                         centre
                                                   service
                                                                                                                                                                      OR Gate
                                                                                                                                                                      Models
                                  ND:health   ND:health                                  D:health
                                               protection                                 service                                                                     Experiments

                                                                                                                                                                      Thesaurus
                                                                                                                                                                      Definitions
                                                                                                                                                                      Unsupervised model
                                                        D:health                                                                                                      Supervised model
                                                         policy
                                                                                                                                                                      Experiments

                                                                                                                                                                      Structured
                                                                                                                                                                      Structured documents
                                                                                                                                                                      Transformations
       • Binary variables (relevant/not relevant), for each                                                                                                           Experiments
                                                                                                                                                                      Link-based categorization
           term, descriptor and non descriptor.                                                                                                                       Multiclass model
                                                                                                                                                                      Experiments

       • Problem: information of different nature mixed in                                                                                                            Multilabel model
                                                                                                                                                                      Experiments

           descriptor nodes!                                                                                                                                          Remarks
                                                                                                                                                                            PhD Dissertation.30
Automatic Indexing From a Thesaurus Using Bayesian
Networks: Unsupervised case III
Introducing concept and virtual nodes
                                                                                                                                                                                                     TC
              protection       policy      health         service     care         dispensary     centre        medical       outpatient       clinic   hospital     institution      psychiatric
                                                                                                                                                                                                     Notation
                                                                                                                                                                                                     Representation
                                                                                                                                                                                                     Particularities
 ND:health   ND:health       D:health   ND:medical       D:health   ND:health     ND:dispensary   D:medical   D:medical       ND:outpatient ND:clinic ND:hospital   D:psychiatric   ND:psychiatric
              protection      policy     service          service   care centre                     centre      institution     clinic                                institution     hospital       Evaluation

                                                                                                                                                                                                     Bayesian networks
        E:health                                    E:health                       E:medical                                         E:medical                              E:psychiatric            Definition
          policy                                      service                        centre                                            institution                            institution
                                                                                                                                                                                                     Storage problems
                                                                                                                                                                                                     Canonical models
                                                                                   C:medical                                         C:medical                              C:psychiatric
                                                                                     centre                                            institution                            institution            OR Gate
                                                                                                                                                                                                     Models
                                                                                                                                                                                                     Experiments
                                                                                    H:health
                                                                                     service
                                                    C:health                                                                                                                                         Thesaurus
                                                      service
                                                                                                                                                                                                     Definitions
                                                                                                                                                                                                     Unsupervised model
                           H:health
                            policy                                                                                                                                                                   Supervised model
                                                                                                                                                                                                     Experiments
        C:health
          policy
                                                                                                                                                                                                     Structured
                                                                                                                                                                                                     Structured documents
                                                                                                                                                                                                     Transformations
                                                                                                                                                                                                     Experiments

       • New concept nodes, C.                                                                                                                                                                       Link-based categorization
                                                                                                                                                                                                     Multiclass model

       • New Hierarchy nodes, HC .                                                                                                                                                                   Experiments
                                                                                                                                                                                                     Multilabel model
                                                                                                                                                                                                     Experiments
       • New Equivalence nodes, EC .                                                                                                                                                                 Remarks
                                                                                                                                                                                                           PhD Dissertation.31
Automatic Indexing From a Thesaurus Using Bayesian
Networks: Unsupervised case IV
Probability distributions
 We already have the structure, but we have to specify                                       TC
                                                                                             Notation

 the distributions on each node.                                                             Representation
                                                                                             Particularities

 Many parents ⇒ canonical models.                                                            Evaluation

                                                                                             Bayesian networks
    • D and ND nodes (SUM):                                                                  Definition
                                                                                             Storage problems

      p(d + |pa(D))         =       T ∈R(pa(D)) w(T , D).
                                                                                             Canonical models

                                                                                             OR Gate
    • HC nodes (SUM):                                                                        Models

         +                                                                                   Experiments
      p(hc |pa(HC )) =                C ∈R(pa(HC )) w(C , HC ).                              Thesaurus
    • EC nodes (OR):                                                                         Definitions
                                                                                             Unsupervised model
         +
      p(ec |pa(EC )) = 1 −                  D∈R(pa(EC )) (1 − w(D, C)).
                                                                                             Supervised model
                                                                                             Experiments


    • C nodes (OR):                                                                          Structured
                                                                                             Structured documents
                                                                                             Transformations
                                                                            +           +
                        8
                        > 1 − (1 − w(EC , C))(1 − w(HC , C))
                        >                                      if ec   =   ec , hc   = hc    Experiments
                        >
                        < w(E , C)                                          +           −    Link-based categorization
        +                    C                                 if ec   =   ec , hc   = hc
     p(c |{ec , hc }) =                                                     −            +   Multiclass model
                        > w(HC , C)
                        >                                      if ec   =   ec , hc   = hc    Experiments
                        >                                                   −            −
                          0                                    if ec   =   ec , hc   = hc
                        :
                                                                                             Multilabel model
                                                                                             Experiments

                                                                                             Remarks
                                                                                                   PhD Dissertation.32
Automatic Indexing From a Thesaurus Using Bayesian
Networks: Unsupervised case V
Inference
                                                              TC
 Classification is seen as inference.                          Notation
                                                              Representation
                                                              Particularities
   • Given a document d, we set the term variables            Evaluation


      T ∈ d to t + , t − otherwise.                           Bayesian networks
                                                              Definition
                                                              Storage problems
                                                              Canonical models


   • Exact propagation is carried out:                        OR Gate
                                                              Models

        • First, to descriptor nodes.                         Experiments

                                                              Thesaurus
                                                              Definitions
        • Then, to EC nodes.                                  Unsupervised model
                                                              Supervised model
                                                              Experiments

        • Following a topological order, probabilities        Structured
               +
            p(hc |pa(HC )) are computed after their parents   Structured documents
                                                              Transformations
               +
            p(c |pa(C)) are set.                              Experiments
                                                              Link-based categorization
                                                              Multiclass model
                                                              Experiments

   • Final   p(c + |pa(C)), ∀C   values are returned.         Multilabel model
                                                              Experiments

                                                              Remarks
                                                                    PhD Dissertation.33
Automatic Indexing From a Thesaurus Using Bayesian
Networks: Supervised model I
Changes from the unsupervised case
                                                         TC
                                                         Notation
                                                         Representation
                                                         Particularities
                                                         Evaluation


   • Concept also receives information from labeled      Bayesian networks
                                                         Definition

     documents in the training set.                      Storage problems
                                                         Canonical models

                                                         OR Gate
                                                         Models
   • We add a training node TC as new parent of the      Experiments


     concept one. The node is an OR gate ML classifier.   Thesaurus
                                                         Definitions
                                                         Unsupervised model
                                                         Supervised model
                                                         Experiments
   • Distributions of C nodes are modified                Structured
     consequently.                                       Structured documents
                                                         Transformations
                                                         Experiments
                                                         Link-based categorization
                                                         Multiclass model
                                                         Experiments
                                                         Multilabel model
                                                         Experiments

                                                         Remarks
                                                               PhD Dissertation.34
Automatic Indexing From a Thesaurus Using Bayesian
Networks: Supervised model II
Graphically: unsupervised
                                                                                                                                                                                                      TC
                                                                                                                                                                                                      Notation
                                                                                                                                                                                                      Representation
                                                                                                                                                                                                      Particularities
                                                                                                                                                                                                      Evaluation
               protection       policy      health         service     care         dispensary     centre        medical       outpatient       clinic   hospital     institution      psychiatric

                                                                                                                                                                                                      Bayesian networks
                                                                                                                                                                                                      Definition
  ND:health   ND:health       D:health   ND:medical       D:health   ND:health     ND:dispensary   D:medical   D:medical       ND:outpatient ND:clinic ND:hospital   D:psychiatric   ND:psychiatric
               protection      policy     service          service   care centre                     centre      institution     clinic                                institution     hospital       Storage problems
                                                                                                                                                                                                      Canonical models

         E:health
           policy
                                                     E:health
                                                       service
                                                                                    E:medical
                                                                                      centre
                                                                                                                                      E:medical
                                                                                                                                        institution
                                                                                                                                                                             E:psychiatric            OR Gate
                                                                                                                                                                               institution
                                                                                                                                                                                                      Models
                                                                                                                                                                                                      Experiments
                                                                                    C:medical                                         C:medical                              C:psychiatric
                                                                                      centre                                            institution                            institution
                                                                                                                                                                                                      Thesaurus
                                                                                                                                                                                                      Definitions
                                                                                     H:health                                                                                                         Unsupervised model
                                                                                      service
                                                                                                                                                                                                      Supervised model
                                                     C:health
                                                       service                                                                                                                                        Experiments


                            H:health                                                                                                                                                                  Structured
                             policy
                                                                                                                                                                                                      Structured documents
         C:health                                                                                                                                                                                     Transformations
           policy
                                                                                                                                                                                                      Experiments
                                                                                                                                                                                                      Link-based categorization
                                                                                                                                                                                                      Multiclass model
                                                                                                                                                                                                      Experiments
                                                                                                                                                                                                      Multilabel model
                                                                                                                                                                                                      Experiments

                                                                                                                                                                                                      Remarks
                                                                                                                                                                                                            PhD Dissertation.35
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models
PhD Dissertation on Document Classification Models

More Related Content

Viewers also liked

Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine LearningTom Maiaroto
 
Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
Bayesian Network Modeling using Python and R
Bayesian Network Modeling using Python and RBayesian Network Modeling using Python and R
Bayesian Network Modeling using Python and RPyData
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKOlivier Grisel
 

Viewers also liked (8)

Text categorization
Text categorizationText categorization
Text categorization
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine Learning
 
Text categorization
Text categorizationText categorization
Text categorization
 
Bayesian Network Modeling using Python and R
Bayesian Network Modeling using Python and RBayesian Network Modeling using Python and R
Bayesian Network Modeling using Python and R
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 

Recently uploaded

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 

Recently uploaded (20)

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 

PhD Dissertation on Document Classification Models

  • 1. PhD Dissertation Document Classification Models TC Notation Representation based on Bayesian networks Particularities Evaluation Alfonso E. Romero Bayesian networks Definition Storage problems Canonical models April 27, 2010 OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Advisors: Luis M. de Campos, and Transformations Experiments Juan M. Fernández-Luna Link-based categorization Multiclass model Department of Computer Science and A.I. Experiments Multilabel model University of Granada Experiments Remarks PhD Dissertation.1
  • 2. A brief overview: the problem I We shall solve problems in document categorization... TC Notation Representation Particularities Evaluation New HSL New HSL Madrid-Valencia Madrid-Valencia Bayesian networks to open in December, 2010 ? to open in December, 2010 Definition Storage problems Canonical models National OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Sports Society World National Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.2
  • 3. A brief overview: the problem II ... in particular, automatic document categorization... TC Notation Representation Particularities Evaluation New HSL New HSL Madrid-Valencia Madrid-Valencia Bayesian networks to open in December, 2010 ? to open in December, 2010 Definition Storage problems Canonical models National OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Sports Society World National Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.3
  • 4. A brief overview: the problem III ... concretely, supervised document categorization. TC Notation Representation Particularities The Alhambra of Learning procedure Evaluation Granada, most visited National Bayesian networks monument in 2009 Definition Storage problems Canonical models Prime minister OR Gate Zapatero to give a Algorithm (classifier) Models speech on the National Experiments Parliament Thesaurus Definitions Unsupervised model AVE train to run at Supervised model 350 km/h in 2010 Experiments from Madrid to National Structured Barcelona Structured documents Transformations Experiments Labeled corpus Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.4
  • 5. A brief overview: the methods I Which learning method shall we use? TC Notation Representation Neural networks Particularities Evaluation Support Vector Machines Bayesian networks Definition Storage problems k-NN�methods Canonical models OR Gate Bayesian networks Models Experiments and probabilistic methods Thesaurus Definitions Unsupervised model Decision trees Supervised model Experiments Structured Evolutive algorithms Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.5
  • 6. A brief overview: the methods II The answer: TC Notation Representation Neural networks Particularities Evaluation Support Vector Machines Bayesian networks Definition Storage problems k-NN�methods Canonical models OR Gate Bayesian networks Models Experiments and probabilistic methods Thesaurus Definitions Unsupervised model Decision trees Supervised model Experiments Evolutive algorithms Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.6
  • 7. A brief overview: the methods III TC But, why? Notation Representation Particularities Evaluation Bayesian networks • Strong theoretical foundation (probability theory). Definition Storage problems Canonical models OR Gate • Models for (uncertain) knowledge representation. Models Experiments Thesaurus Definitions • Great success in related tasks (IR). Unsupervised model Supervised model Experiments Structured • Our background at the group UTAI. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.7
  • 8. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.8
  • 9. Outline ⇒ Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.9
  • 10. Supervised Text Categorization I Provided TC Notation 1 Set of labeled documents DTr (training). Representation Particularities 2 C, set of categories/labels. Evaluation Bayesian networks The goal is to build a model f (classifier) capable of Definition Storage problems predicting categories (of C) of documents in D. Canonical models OR Gate Models Different kinds of labeling: Experiments Thesaurus • f : D → {c, c} (binary). Definitions Unsupervised model • f : D → {c1 , c2 , . . . , cn } (multiclass). Supervised model Experiments • f : D × C → {0, 1} (multilabel). Structured Structured documents Transformations A multilabel problem reduces to |C| binary problems Experiments Link-based categorization C = {c, c}. We often change the codomain from {0, 1} Multiclass model Experiments (hard classification) to [0, 1] (soft classification). Multilabel model Experiments Remarks PhD Dissertation.10
  • 11. Supervised Text Categorization II Document representation: As in Information Retrieval TC Stopwords removal + stemming + Vector Notation representation (Frequency, binary or tf-idf). Representation Particularities Evaluation From (preprocessed) document to vector Bayesian networks Definition Storage problems Canonical models term ⇔ dimension OR Gate Models Experiments Thesaurus Definitions Unsupervised model Example (beginning of John Milton's "Lost Paradise"): Supervised model Experiments Of Mans First Disobedience, and the Fruit Of Mans First Disobedience, and the Fruit Structured Of that Forbidden Tree, whose mortal tast Of that Forbidden Tree, whose mortal tast Structured documents Brought Death into the World, and all our woe, Brought Death into the World, and all our woe, With loss of Eden, till one greater Man... With loss of Eden, till one greater Man... Transformations Experiments Link-based categorization Multiclass model 2 1 1 1 1 1 1 1 1 1 1 1 1 1 Experiments man obedience fruit forbid tree mortal tast bring death world woe loss eden great Multilabel model Experiments Remarks PhD Dissertation.11
  • 12. Supervised Text Categorization III What are the particularities of the problem? It differs from a “classic” Machine Learning problem in: TC Notation Representation Particularities Evaluation • High dimensionality (easily > 10000). Bayesian networks Definition Storage problems Canonical models • Very unbalanced datasets. OR Gate Models Experiments • |C| 0. Thesaurus Definitions Unsupervised model Supervised model Experiments • Sometimes, there is a hierarchy in the set C. Structured Structured documents Transformations Experiments • Sometimes, explicit relationships among Link-based categorization Multiclass model documents are given. Experiments Multilabel model Experiments Remarks PhD Dissertation.12
  • 13. Supervised Text Categorization IV Evaluation How to measure the correctness of documents TC assigned to set of categories? Notation Representation • Binary/multiclass: Particularities Evaluation TP Bayesian networks • Hard categorization Precision: TP+FP and Recall: Definition Storage problems TP 2PR TP+FN , F1 : P+R . Canonical models OR Gate • Soft categorization Precision/Recall BEP. Models Experiments Thesaurus Definitions • Multilabel: micro and macro averages. Unsupervised model Supervised model Experiments Structured • Also, average precision on the 11 std. recall points Structured documents Transformations (category ranking). Experiments Link-based categorization Multiclass model Experiments Multilabel model • Standard corpora: Reuters, Ohsumed, 20 NG. . . Experiments Remarks PhD Dissertation.13
  • 14. Outline 1 Text Categorization TC Notation Representation ⇒ Bayesian networks Particularities Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.14
  • 15. Bayesian networks I Definition and characteristics A set of random variables X1 , . . . , XN in a DAG, verifying P(X1 , . . . , Xn ) = n P(Xi |Pa(Xi )) ⇒ the graph i=1 TC Notation represents independences. Representation Particularities Evaluation Causal interpretation. Bayesian networks Definition Storage problems Canonical models Learning and inference methods available. OR Gate Models Experiments Thesaurus Definitions Unsupervised model Supervised model Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.15
  • 16. Bayesian networks III Estimation and storage problems TC The problem Notation Representation Particularities • One value for each configuration of the parents. Evaluation Bayesian networks • General case: exponential number of parameters Definition Storage problems on the number of parents. Canonical models OR Gate Models Experiments The solution Thesaurus Definitions 1 Few parents per node (not realistic in text). Unsupervised model Supervised model Experiments 2 Write the probability of a node as a deterministic Structured function of the configuration (canonical model). Set Structured documents Transformations of parameters with linear size on the number of Experiments Link-based categorization parents. Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.16
  • 17. Bayesian networks: canonical models Components and examples of canonical models • X = {Xi }: parents (causes), Y : child (effect). TC Notation Representation • Xi in {xi , xi }, Yi in {yi , yi } (occurrence or not). Particularities Evaluation Bayesian networks 1 Noisy-OR gate model: Definition p(y |X) = 1 − Xi ∈R(x) (1 − wOR (Xi , Y )). Storage problems Canonical models 2 Additive model: p(y |X) = Xi ∈R(x) wadd (Xi , Y ). OR Gate Models Experiments Thesaurus ... Definitions X1 X2 XL Unsupervised model Supervised model Experiments Structured Structured documents Transformations Experiments Y Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.17
  • 18. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition ⇒ An OR Gate-Based Text Classifier Storage problems Canonical models OR Gate Models Experiments 4 Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.18
  • 19. An OR Gate-Based Text Classifier Why using OR gates? • The OR gate is a simple model (and fast for TC Notation inference). Representation Particularities • It has gained great success in knowledge Evaluation Bayesian networks representation. Definition Storage problems • Discriminative classifier (models directly p(ci |dj )) Canonical models (NB generative). OR Gate Models • Seems natural the term are causes and category Experiments Thesaurus the effect. Definitions Unsupervised model Supervised model Experiments C C Structured Structured documents Transformations Experiments Link-based categorization T1 T2 ... Tn T1 T2 ... Tn Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.19
  • 20. An OR Gate-Based Text Classifier The model Equations for the OR gate classifier TC Notation Probability distributions: Representation Particularities Evaluation pi (ci |pa(Ci )) = 1 − (1 − w(Tk , Ci )) , Bayesian networks Definition Tk ∈R(pa(Ci )) Storage problems Canonical models pi (c i |pa(Ci )) = 1 − pi (ci |pa(Ci )). OR Gate Models Experiments Inference: Thesaurus Definitions Unsupervised model pi (ci |dj ) = 1 − 1 − w(Tk , Ci ) p(tk |dj ) Supervised model Experiments Tk ∈Pa(Ci ) Structured Structured documents = 1− (1 − w(Tk , Ci )) . Transformations Experiments Tk ∈Pa(Ci )∩dj Link-based categorization Multiclass model Experiments Multilabel model Experiments Model characterized by the w(Tk , Ci ) formula. Remarks PhD Dissertation.20
  • 21. An OR Gate-Based Text Classifier Weight estimation formulae Weight w(Tk , Ci ) means TC Notation Representation ˆ pi (ci |tk , t h ∀Th ∈ Pa(Ci ), Th = Tk ). Particularities Evaluation Bayesian networks ˆ Nik +1 1 ML: w(Tk , Ci ) as p(ci |tk ), w(Tk , Ci ) = N•k +2 (using Definition Storage problems Laplace). Canonical models OR Gate Models Experiments 2 TI: assuming term independence, given the Thesaurus (Ni• −N category, w(Tk , Ci ) = ntNik•k h=k (N−N•hih )N . Definitions Unsupervised model iN )Ni• Supervised model Experiments Structured Notation: N number of words in the training documents. Structured documents Transformations N•k times that term tk appears in training documents Experiments Link-based categorization (N•k = ci Nik ), Ni• is the total number of words in Multiclass model Experiments documents of class ci (Ni• = tk Nik ), M vocabulary size. Multilabel model Experiments Remarks PhD Dissertation.21
  • 22. An OR Gate-Based Text Classifier Pruning independent terms TC Notation Representation Particularities • The model can be improved pruning terms which Evaluation are independent with the category. Bayesian networks Definition Storage problems Canonical models • We run an independence test for each pair OR Gate Models term/category, at a certain confidence level. Only Experiments terms which pass it are kept. Thesaurus Definitions Unsupervised model Supervised model Experiments • The size of the parent set is highly reduced, but Structured classification is often improved. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.22
  • 23. An OR Gate-Based Text Classifier Experiments I: Experimental setting TC Notation Representation Particularities • We compare a Multinomial NB, a Rocchio, OR TI, Evaluation OR ML, and both OR with pruning at levels Bayesian networks Definition {0.9, 0.99, 0.999}. Storage problems Canonical models OR Gate Models • We made experiments on Ohsumed-23, Reuters Experiments and 20 NG. Thesaurus Definitions Unsupervised model Supervised model Experiments • Soft categorization, evaluated with macro/micro Structured BEP, 11 avg std prec, and macro/micro F1 @{1, 3, 5}. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.23
  • 24. An OR Gate-Based Text Classifier Experiments II: Experimental setting TC Notation Representation Results (# of wins on 9 measures): Particularities Evaluation Bayesian networks Definition • Reuters: OR-TI-0.999 (7), OR-TI (1), OR-ML-0.999 Storage problems Canonical models (1). OR Gate Models Experiments Thesaurus • Ohsumed: OR-TI-0.999 (8), OR-ML-0.99 (1). Definitions Unsupervised model Supervised model Experiments • 20 NG: OR-TI (2), OR-ML (2), OR-TI-0.999 (1), NB Structured (4). Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.24
  • 25. An OR Gate-Based Text Classifier Conclusions TC Notation Representation • A new text categorization model, based on noisy OR Particularities Evaluation gates. Bayesian networks Definition • Simple and computationally affordable. Storage problems Canonical models • Results improves Naïve Bayes noticeably. OR Gate Models Experiments Thesaurus Future work Definitions Unsupervised model Supervised model • Use more advanced noisy OR models (leaky). Experiments Structured • Use another canonical models. Structured documents Transformations • Explore another alternatives for term pruning. Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.25
  • 26. Outline 1 Text Categorization TC Notation Representation Particularities 2 Bayesian networks Evaluation Bayesian networks Definition Storage problems 3 An OR Gate-Based Text Classifier Canonical models OR Gate Models Experiments ⇒ Automatic Indexing From a Thesaurus Using Thesaurus Bayesian Networks Definitions Unsupervised model Supervised model Experiments 5 Structured Document Categorization Using Bayesian Structured Structured documents Networks Transformations Experiments Link-based categorization Multiclass model Experiments 6 Final Remarks Multilabel model Experiments Remarks PhD Dissertation.26
  • 27. Automatic Indexing From a Thesaurus Using Bayesian Networks: Thesauri I Definitions TC A thesaurus Notation Representation A list of terms representing concepts, grouped those with Particularities Evaluation the same meaning, with hierarchical relationships among Bayesian networks them. Definition Storage problems Canonical models ND:psychiatric OR Gate ND:outpatient ND:clinic ND:hospital hospital ND:health ND:dispensary Models clinic care centre Experiments Thesaurus Definitions D:psychiatric D:medical D:medical ND:medical Unsupervised model institution institution centre service Supervised model Experiments Structured ND:health ND:health D:health Structured documents protection service Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model D:health Experiments policy Remarks PhD Dissertation.27
  • 28. Automatic Indexing From a Thesaurus Using Bayesian Networks: Thesauri II Indexing with a thesauri TC • Problem: associating descriptors (keywords) to Notation Representation documents (scientific, medical, legal,. . . ). Particularities Evaluation • Manually: expensive and time consuming work (due Bayesian networks to thousand of descriptors! [EUROVOC > 6000]). Definition Storage problems Besides: Canonical models OR Gate • How many descriptors should we assign? Models • Which descriptor should assign in the hierarchy? Experiments Thesaurus Definitions We propose an automatic solution Unsupervised model Supervised model Experiments 1 TC problem (categories ≡ descriptors). Structured Structured documents 2 Makes use of the meta-information of the thesaurus. Transformations Experiments Link-based categorization 3 Unsupervised and supervised case. Multiclass model Experiments 4 Based on BNs. Multilabel model Experiments Remarks PhD Dissertation.28
  • 29. Automatic Indexing From a Thesaurus Using Bayesian Networks: Unsupervised case I From the thesaurus... TC Notation Representation Particularities Evaluation ND:psychiatric ND:outpatient ND:clinic ND:hospital hospital ND:health ND:dispensary clinic care centre Bayesian networks Definition Storage problems D:psychiatric D:medical D:medical ND:medical centre Canonical models institution institution service OR Gate Models ND:health ND:health D:health protection service Experiments Thesaurus Definitions Unsupervised model D:health Supervised model policy Experiments Structured Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.29
  • 30. Automatic Indexing From a Thesaurus Using Bayesian Networks: Unsupervised case II ... to the Bayesian network. TC policy protection service health psychiatric outpatient hospital clinic institution medical care centre dispensary Notation Representation Particularities Evaluation ND:psychiatric ND:outpatient ND:clinic ND:hospital hospital ND:health ND:dispensary clinic care centre Bayesian networks Definition Storage problems D:psychiatric D:medical D:medical ND:medical Canonical models institution institution centre service OR Gate Models ND:health ND:health D:health protection service Experiments Thesaurus Definitions Unsupervised model D:health Supervised model policy Experiments Structured Structured documents Transformations • Binary variables (relevant/not relevant), for each Experiments Link-based categorization term, descriptor and non descriptor. Multiclass model Experiments • Problem: information of different nature mixed in Multilabel model Experiments descriptor nodes! Remarks PhD Dissertation.30
  • 31. Automatic Indexing From a Thesaurus Using Bayesian Networks: Unsupervised case III Introducing concept and virtual nodes TC protection policy health service care dispensary centre medical outpatient clinic hospital institution psychiatric Notation Representation Particularities ND:health ND:health D:health ND:medical D:health ND:health ND:dispensary D:medical D:medical ND:outpatient ND:clinic ND:hospital D:psychiatric ND:psychiatric protection policy service service care centre centre institution clinic institution hospital Evaluation Bayesian networks E:health E:health E:medical E:medical E:psychiatric Definition policy service centre institution institution Storage problems Canonical models C:medical C:medical C:psychiatric centre institution institution OR Gate Models Experiments H:health service C:health Thesaurus service Definitions Unsupervised model H:health policy Supervised model Experiments C:health policy Structured Structured documents Transformations Experiments • New concept nodes, C. Link-based categorization Multiclass model • New Hierarchy nodes, HC . Experiments Multilabel model Experiments • New Equivalence nodes, EC . Remarks PhD Dissertation.31
  • 32. Automatic Indexing From a Thesaurus Using Bayesian Networks: Unsupervised case IV Probability distributions We already have the structure, but we have to specify TC Notation the distributions on each node. Representation Particularities Many parents ⇒ canonical models. Evaluation Bayesian networks • D and ND nodes (SUM): Definition Storage problems p(d + |pa(D)) = T ∈R(pa(D)) w(T , D). Canonical models OR Gate • HC nodes (SUM): Models + Experiments p(hc |pa(HC )) = C ∈R(pa(HC )) w(C , HC ). Thesaurus • EC nodes (OR): Definitions Unsupervised model + p(ec |pa(EC )) = 1 − D∈R(pa(EC )) (1 − w(D, C)). Supervised model Experiments • C nodes (OR): Structured Structured documents Transformations + + 8 > 1 − (1 − w(EC , C))(1 − w(HC , C)) > if ec = ec , hc = hc Experiments > < w(E , C) + − Link-based categorization + C if ec = ec , hc = hc p(c |{ec , hc }) = − + Multiclass model > w(HC , C) > if ec = ec , hc = hc Experiments > − − 0 if ec = ec , hc = hc : Multilabel model Experiments Remarks PhD Dissertation.32
  • 33. Automatic Indexing From a Thesaurus Using Bayesian Networks: Unsupervised case V Inference TC Classification is seen as inference. Notation Representation Particularities • Given a document d, we set the term variables Evaluation T ∈ d to t + , t − otherwise. Bayesian networks Definition Storage problems Canonical models • Exact propagation is carried out: OR Gate Models • First, to descriptor nodes. Experiments Thesaurus Definitions • Then, to EC nodes. Unsupervised model Supervised model Experiments • Following a topological order, probabilities Structured + p(hc |pa(HC )) are computed after their parents Structured documents Transformations + p(c |pa(C)) are set. Experiments Link-based categorization Multiclass model Experiments • Final p(c + |pa(C)), ∀C values are returned. Multilabel model Experiments Remarks PhD Dissertation.33
  • 34. Automatic Indexing From a Thesaurus Using Bayesian Networks: Supervised model I Changes from the unsupervised case TC Notation Representation Particularities Evaluation • Concept also receives information from labeled Bayesian networks Definition documents in the training set. Storage problems Canonical models OR Gate Models • We add a training node TC as new parent of the Experiments concept one. The node is an OR gate ML classifier. Thesaurus Definitions Unsupervised model Supervised model Experiments • Distributions of C nodes are modified Structured consequently. Structured documents Transformations Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.34
  • 35. Automatic Indexing From a Thesaurus Using Bayesian Networks: Supervised model II Graphically: unsupervised TC Notation Representation Particularities Evaluation protection policy health service care dispensary centre medical outpatient clinic hospital institution psychiatric Bayesian networks Definition ND:health ND:health D:health ND:medical D:health ND:health ND:dispensary D:medical D:medical ND:outpatient ND:clinic ND:hospital D:psychiatric ND:psychiatric protection policy service service care centre centre institution clinic institution hospital Storage problems Canonical models E:health policy E:health service E:medical centre E:medical institution E:psychiatric OR Gate institution Models Experiments C:medical C:medical C:psychiatric centre institution institution Thesaurus Definitions H:health Unsupervised model service Supervised model C:health service Experiments H:health Structured policy Structured documents C:health Transformations policy Experiments Link-based categorization Multiclass model Experiments Multilabel model Experiments Remarks PhD Dissertation.35