SlideShare a Scribd company logo
1 of 98
Download to read offline
Harnessing Folksonomies for Resource
            Classification
              PhD Thesis


            Arkaitz Zubiaga

                 UNED


            July 12th, 2011

                Advisors:
        Raquel Mart´ınez Unanue
        V´
         ıctor Fresno Fern´ndez
                          a
PhD Thesis          Table of Contents
   Arkaitz
   Zubiaga
                     1   Motivation
Motivation

Selection of a
Classifier            2   Selection of a Classifier
STS &
Datasets
                     3   STS & Datasets
Representing
the
Aggregation of
Tags
                     4   Representing the Aggregation of Tags
Tag
Distributions        5   Tag Distributions on STS
on STS

User Behavior
on STS               6   User Behavior on STS
Conclusions &
Outlook
                     7   Conclusions & Outlook
Publications

                     8   Publications


           Arkaitz Zubiaga (UNED)             PhD Thesis        July 12th, 2011   2 / 98
Motivation

 PhD Thesis          Index
   Arkaitz
   Zubiaga
                     1   Motivation
Motivation

Selection of a
Classifier            2   Selection of a Classifier
STS &
Datasets
                     3   STS & Datasets
Representing
the
Aggregation of
Tags
                     4   Representing the Aggregation of Tags
Tag
Distributions        5   Tag Distributions on STS
on STS

User Behavior
on STS               6   User Behavior on STS
Conclusions &
Outlook
                     7   Conclusions & Outlook
Publications

                     8   Publications


           Arkaitz Zubiaga (UNED)               PhD Thesis      July 12th, 2011   3 / 98
Motivation

 PhD Thesis          Resource Classification
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)         PhD Thesis   July 12th, 2011   4 / 98
Motivation

 PhD Thesis          Resource Classification
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)         PhD Thesis   July 12th, 2011   5 / 98
Motivation

 PhD Thesis          Resource Classification
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)         PhD Thesis   July 12th, 2011   6 / 98
Motivation

 PhD Thesis          Resource Classification
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier                   Classifying resources is a common task.
STS &
Datasets
                                    Web pages, books, movies, files,...
Representing
the
Aggregation of
                            Large collections of resources → expensive & effortful
Tags                        to classify manually.
Tag
Distributions
                                    LoC reported an average cost of $94.58 for cataloging
on STS                              each book in 2002.
User Behavior
on STS

Conclusions &
                            Enormous costs and efforts → automatic classification.
Outlook

Publications




           Arkaitz Zubiaga (UNED)                   PhD Thesis              July 12th, 2011   7 / 98
Motivation

 PhD Thesis          Resource Classification
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier
                            Representation of resources → self-content.
STS &
Datasets

Representing                Use of self-content of resources presents some issues:
the
Aggregation of                      Not always representative enough.
Tags
                                    Not always accessible (e.g., books).
Tag
Distributions
on STS

User Behavior
                            Social tags provided by users → alternative to solve the
on STS                      problem.
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                   PhD Thesis             July 12th, 2011   8 / 98
Motivation

 PhD Thesis          Tagging
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications



                                    T1, T2, T3 = sets of tags.

           Arkaitz Zubiaga (UNED)          PhD Thesis            July 12th, 2011   9 / 98
Motivation

 PhD Thesis          Social Tagging
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications                Aggregation of user annotations → folksonomy.
                            Folksonomy: Folk (People) + Taxis (Classification) +
                            Nomos (Management).

           Arkaitz Zubiaga (UNED)              PhD Thesis           July 12th, 2011   10 / 98
Motivation

 PhD Thesis          Organization of Resources
   Arkaitz
   Zubiaga


Motivation

Selection of a              User annotations → own organization of resources.
Classifier

STS &
Datasets                                     A user’s tags
Representing
the                                     Tag       # Resources
Aggregation of
Tags                                    research        82
Tag                                     twitter         28
Distributions
on STS                                  web2.0          35
User Behavior                           language        42
on STS
                                        english         64
Conclusions &
Outlook                                 ...             ...
Publications




           Arkaitz Zubiaga (UNED)             PhD Thesis           July 12th, 2011   11 / 98
Motivation

 PhD Thesis          Example of Bookmarks
   Arkaitz
   Zubiaga


Motivation                          User    Resource            Tags
Selection of a               1      user1   flickr.com           photo, web2.0, social
Classifier
                             2      user2   flickr.com           photography, images
STS &
Datasets                     3      user1   google.com          searchengine
Representing                 4      user3   twitter.com         microblogging, twitter
the
Aggregation of
Tags

Tag                   Bookmark: (1) user ui ∈ U who annotates
                                (2) resource rj ∈ R being annotated
Distributions
on STS

User Behavior                   (3) tags Tij = {t1 , ..., tn } ∈ T utilized.
on STS

Conclusions &
Outlook

Publications

                                            bij : ui × rj × Tij

           Arkaitz Zubiaga (UNED)                  PhD Thesis                 July 12th, 2011   12 / 98
Motivation

 PhD Thesis          Sum of Annotations
   Arkaitz
   Zubiaga


Motivation                                Top   tags (79,681 users)
Selection of a
Classifier
                                    Tag Rank    Tag            User   Count
STS &                                   1       photos                22,712
Datasets
                                        2       flickr                19,046
Representing
the                                     3       photography           15,968
Aggregation of
Tags                                    4       photo                 15,225
Tag                                     5       sharing               10,648
Distributions
on STS                                  6       images                 9,637
User Behavior                           7       web2.0                 9,528
on STS

Conclusions &
                                        8       community              4,571
Outlook                                 9       social                 3,798
Publications
                                       10       pictures               3,115



           Arkaitz Zubiaga (UNED)               PhD Thesis              July 12th, 2011   13 / 98
Motivation

 PhD Thesis          Tag-based Resource Classification
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)         PhD Thesis    July 12th, 2011   14 / 98
Motivation

 PhD Thesis          Problem Statement
   Arkaitz
   Zubiaga


Motivation
                      How can the annotations provided by users on social tagging
Selection of a
                      systems be exploited to improve the accuracy of a resource
Classifier
                                          classification task?
STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)            PhD Thesis            July 12th, 2011   15 / 98
Motivation

 PhD Thesis          Related Work
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
                     Social tags for information management:
Datasets
                            Search: Bao et al. (2007) & Heymann et al. (2008).
Representing
the

                            Recommender Systems: Shepitsen et al. (2008) & Li
Aggregation of
Tags

Tag                         et al. (2008).
Distributions
on STS

User Behavior
on STS
                            Enhanced Browsing: Smith (2008).
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)              PhD Thesis           July 12th, 2011   16 / 98
Motivation

 PhD Thesis          Related Work
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier                   Classification: Noll and Meinel (2008) → statistical
STS &
Datasets
                            analysis of matches between tags & taxonomies.
Representing
                                    Tags are useful for broad categorization.
the                                 Not for narrower categorization.
Aggregation of
Tags

Tag
Distributions
                            Lack of further research with:
on STS                              Actual classification experiments.
User Behavior                       Other types of resources.
on STS
                                    Different representations of social tags.
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                   PhD Thesis                  July 12th, 2011   17 / 98
Selection of a Classifier

 PhD Thesis          Index
   Arkaitz
   Zubiaga
                     1   Motivation
Motivation

Selection of a
Classifier            2   Selection of a Classifier
STS &
Datasets
                     3   STS & Datasets
Representing
the
Aggregation of
Tags
                     4   Representing the Aggregation of Tags
Tag
Distributions        5   Tag Distributions on STS
on STS

User Behavior
on STS               6   User Behavior on STS
Conclusions &
Outlook
                     7   Conclusions & Outlook
Publications

                     8   Publications


           Arkaitz Zubiaga (UNED)                       PhD Thesis   July 12th, 2011   18 / 98
Selection of a Classifier

 PhD Thesis          Characteristics of the task
   Arkaitz
   Zubiaga


Motivation                  We have:
Selection of a
Classifier
                                    Large set of resources: some labeled + many unlabeled.
STS &
                                    Multiclass taxonomy.
Datasets

Representing
the
                            Automated classifiers learn a model from labeled
Aggregation of              resources.
Tags

Tag
                                    This model is used to classify unlabeled resources
Distributions                       afterward.
on STS

User Behavior
on STS                      2 learning settings:
Conclusions &
Outlook
                                    Supervised: only labeled resources considered for learning.
                                    Semi-supervised: unlabeled resources are also taken into
Publications
                                    account.



           Arkaitz Zubiaga (UNED)                         PhD Thesis          July 12th, 2011   19 / 98
Selection of a Classifier

 PhD Thesis          Support Vector Machines (SVM)
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &               Hyperplane that separates with largest margin.
Outlook

Publications
                            Use of kernels → redimensions the space.

                            Resource/Hyperplane margin → Classifier’s reliability.
           Arkaitz Zubiaga (UNED)                       PhD Thesis     July 12th, 2011   20 / 98
Selection of a Classifier

 PhD Thesis          Selection of a Classifier
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier                   SVMs solve binary problems by default.
STS &
Datasets

Representing                To solve multiclass tasks:
the
Aggregation of                      Native multiclass classifier (mSVM).
Tags                                Combining binary classifiers:
Tag                                      one-against-all (oaaSVM).
Distributions
on STS                                   one-against-one (oaoSVM).
User Behavior
on STS

Conclusions &
                            Both supervised (s) and semi-supervised (ss).
Outlook

Publications




           Arkaitz Zubiaga (UNED)                         PhD Thesis      July 12th, 2011   21 / 98
Selection of a Classifier

 PhD Thesis          Experiment Settings
   Arkaitz
   Zubiaga


Motivation

Selection of a
                            3 benchmark datasets to analyze suitability of
Classifier                   classifiers:
STS &
Datasets

Representing           Dataset              # web pages              # trainset   # categories
                       BankSearch
the
Aggregation of                                   10,000                   3,000             10
                       WebKB
Tags
                                                   4,518                  1,000              6
Tag
Distributions          Y! Science                    788                    100              6
on STS

User Behavior
on STS
                            We present accuracy to show performance.
Conclusions &
Outlook

Publications                We perform 6 runs, and show the average accuracy.



           Arkaitz Zubiaga (UNED)                       PhD Thesis                July 12th, 2011   22 / 98
Selection of a Classifier

 PhD Thesis          Results
   Arkaitz
   Zubiaga


Motivation                                         BankSearch        WebKB   Y! Science
Selection of a             mSVM (s)                   .925            .810      .825
Classifier
                           mSVM (ss)                  .923            .778      .836
STS &
Datasets                   oaaSVM (s)                 .843            .776      .536
Representing
the
                           oaaSVM (ss)                .842            .773      .565
Aggregation of
Tags
                           oaoSVM (s)                 .826            .775      .483
Tag
                           oaoSVM (ss)                .811            .754      .514
Distributions
on STS

User Behavior               Native multiclass classifier performs best, while
on STS

Conclusions &
                            supervised semi-supervised.
Outlook

Publications
                            We used the supervised approach, as it is
                            computationally less expensive.


           Arkaitz Zubiaga (UNED)                       PhD Thesis           July 12th, 2011   23 / 98
STS & Datasets

 PhD Thesis          Index
   Arkaitz
   Zubiaga
                     1   Motivation
Motivation

Selection of a
Classifier            2   Selection of a Classifier
STS &
Datasets
                     3   STS & Datasets
Representing
the
Aggregation of
Tags
                     4   Representing the Aggregation of Tags
Tag
Distributions        5   Tag Distributions on STS
on STS

User Behavior
on STS               6   User Behavior on STS
Conclusions &
Outlook
                     7   Conclusions & Outlook
Publications

                     8   Publications


           Arkaitz Zubiaga (UNED)                 PhD Thesis    July 12th, 2011   24 / 98
STS & Datasets

 PhD Thesis          Requirements
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets                    Selected STS should have:
Representing                        Large communities involved.
the
Aggregation of                      Public access to data.
Tags
                                    Consolidated taxonomies as a ground truth.
Tag
Distributions

                            We chose Delicious, LibraryThing & GoodReads.
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                   PhD Thesis             July 12th, 2011   25 / 98
STS & Datasets

 PhD Thesis          Characteristics of STS
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &                                        Delicious            LibraryThing         GoodReads
Datasets
                       Resources             web documents        books                books
Representing
the                    Tag suggestions       based on earlier     no                   based on earlier
Aggregation of                               bookmarks on the                          tags utilized by
Tags
                                             resource                                  the user
Tag
Distributions          Tag insertion         space-separated      comma-separated      one by one text-
on STS                                                                                 box
User Behavior          Saving a resource     prompts user to      prompts user to      user needs to
on STS
                                             add tags             add tags at sec-     click again to add
Conclusions &                                                     ond step             tags
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis                      July 12th, 2011   26 / 98
STS & Datasets

 PhD Thesis          Characteristics of STS
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &                                        Delicious            LibraryThing         GoodReads
Datasets
                       Resources             web documents        books                books
Representing
the                    Tag suggestions       based on earlier     no                   based on earlier
Aggregation of                               bookmarks on the                          tags utilized by
Tags
                                             resource                                  the user
Tag
Distributions          Tag insertion         space-separated      comma-separated      one by one text-
on STS                                                                                 box
User Behavior          Saving a resource     prompts user to      prompts user to      user needs to
on STS
                                             add tags             add tags at sec-     click again to add
Conclusions &                                                     ond step             tags
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis                      July 12th, 2011   27 / 98
STS & Datasets

 PhD Thesis          Characteristics of STS
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &                                        Delicious            LibraryThing         GoodReads
Datasets
                       Resources             web documents        books                books
Representing
the                    Tag suggestions       based on earlier     no                   based on earlier
Aggregation of                               bookmarks on the                          tags utilized by
Tags
                                             resource                                  the user
Tag
Distributions          Tag insertion         space-separated      comma-separated      one by one text-
on STS                                                                                 box
User Behavior          Saving a resource     prompts user to      prompts user to      user needs to
on STS
                                             add tags             add tags at sec-     click again to add
Conclusions &                                                     ond step             tags
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis                      July 12th, 2011   28 / 98
STS & Datasets

 PhD Thesis          Characteristics of STS
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &                                        Delicious            LibraryThing         GoodReads
Datasets
                       Resources             web documents        books                books
Representing
the                    Tag suggestions       based on earlier     no                   based on earlier
Aggregation of                               bookmarks on the                          tags utilized by
Tags
                                             resource                                  the user
Tag
Distributions          Tag insertion         space-separated      comma-separated      one by one text-
on STS                                                                                 box
User Behavior          Saving a resource     prompts user to      prompts user to      user needs to
on STS
                                             add tags             add tags at sec-     click again to add
Conclusions &                                                     ond step             tags
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis                      July 12th, 2011   29 / 98
STS & Datasets

 PhD Thesis          Characteristics of STS
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &                                        Delicious            LibraryThing         GoodReads
Datasets
                       Resources             web documents        books                books
Representing
the                    Tag suggestions       based on earlier     no                   based on earlier
Aggregation of                               bookmarks on the                          tags utilized by
Tags
                                             resource                                  the user
Tag
Distributions          Tag insertion         space-separated      comma-separated      one by one text-
on STS                                                                                 box
User Behavior          Saving a resource     prompts user to      prompts user to      user needs to
on STS
                                             add tags             add tags at sec-     click again to add
Conclusions &                                                     ond step             tags
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis                      July 12th, 2011   30 / 98
STS & Datasets

 PhD Thesis          Retrieval of Categorized Resources
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier            Retrieval of popular annotated resources, which were also
STS &                categorized by experts.
Datasets

Representing
the
Aggregation of                           Top level (L1)       Second level (L2)
Tags

Tag
                                       Resources Classes     Resources Classes
Distributions
on STS
                       Web   ODP        12,616       17        12,286      243
                             DDC        27,299       10        27,040       99
User Behavior
                       Books
on STS
                             LCC        24,861       20        23,565      204
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)             PhD Thesis          July 12th, 2011   31 / 98
STS & Datasets

 PhD Thesis          Retrieval of Additional User Annotations
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier
                            Delicious: 300,571,231 bookmarks
STS &
                            → 273,478,137 annotated (91.00%)
Datasets


                            LibraryThing: 44,612,784 bookmarks
Representing
the
Aggregation of
Tags                        → 22,343,427 annotated (50.08%)
Tag
Distributions
on STS                      GoodReads: 47,302,861 bookmarks
User Behavior               → 9,323,539 annotated (19.71%)
on STS

Conclusions &        Importance of system’s encouragement to tagging
Outlook

Publications
                     resources.




           Arkaitz Zubiaga (UNED)               PhD Thesis       July 12th, 2011   32 / 98
STS & Datasets

 PhD Thesis          Tag Popularity on Resources
   Arkaitz
   Zubiaga


Motivation

Selection of a                          1
Classifier
                                       0.9
STS &                                                                            Delicious
Datasets

Representing
                                       0.8
                                                                                 LibraryThing
the
Aggregation of
                                       0.7                                       GoodReads
                       Average usage




Tags
                                       0.6
Tag
Distributions                          0.5
on STS

User Behavior                          0.4
on STS
                                       0.3
Conclusions &
Outlook
                                       0.2
Publications
                                       0.1

                                        0

                                                         Tag rank on resources
           Arkaitz Zubiaga (UNED)                      PhD Thesis                     July 12th, 2011   33 / 98
STS & Datasets

 PhD Thesis          Tag Novelty in Bookmarks by Rank
   Arkaitz
   Zubiaga

                                       100%
Motivation

Selection of a
Classifier
                                       90%                                                   Delicious
STS &                                  80%                                                   LibraryThing
Datasets
                                       70%
Representing
the
Aggregation of
                                       60%
                        % of novelty




Tags
                                       50%
Tag
Distributions
                                       40%
on STS

User Behavior                          30%
on STS

Conclusions &                          20%
Outlook
                                       10%
Publications
                                        0%
                                               3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99
                                              1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

                                                                         Bookmark rank

           Arkaitz Zubiaga (UNED)                                 PhD Thesis                        July 12th, 2011   34 / 98
STS & Datasets

 PhD Thesis          Retrieval of Additional Data
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier                   URLs:
STS &
Datasets
                                    Self-content, by crawling URLs.
Representing
                                    User reviews (Delicious & StumbleUpon).
the
Aggregation of
Tags                        Books:
Tag
Distributions
                                    Self-content (unavailable):
on STS                                  Synopses (Barnes&Noble).
User Behavior                           Editorial reviews (Amazon).
on STS
                                    User reviews (LibraryThing, GoodReads & Amazon).
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                   PhD Thesis            July 12th, 2011   35 / 98
STS & Datasets

 PhD Thesis          Summary of the Analysis of Datasets
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets
                            Few users annotate resources when the system does
                            not encourage to do it.
Representing
the
Aggregation of
Tags

Tag                         Resource-based tag suggestions → Repeated use of
Distributions
on STS                      popular tags.
User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)               PhD Thesis         July 12th, 2011   36 / 98
Representing the Aggregation of Tags

 PhD Thesis          Index
   Arkaitz
   Zubiaga
                     1   Motivation
Motivation

Selection of a
Classifier            2   Selection of a Classifier
STS &
Datasets
                     3   STS & Datasets
Representing
the
Aggregation of
Tags
                     4   Representing the Aggregation of Tags
Tag
Distributions        5   Tag Distributions on STS
on STS

User Behavior
on STS               6   User Behavior on STS
Conclusions &
Outlook
                     7   Conclusions & Outlook
Publications

                     8   Publications


           Arkaitz Zubiaga (UNED)                         PhD Thesis   July 12th, 2011   37 / 98
Representing the Aggregation of Tags

 PhD Thesis          Representing Resources Using Tags
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets                    Different ways to aggregate user annotations on a
Representing                vectorial representation.
the
Aggregation of
Tags

Tag
                            2 major factors to consider:
Distributions                       What tags to use?
on STS
                                    How to weigh those tags?
User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                         PhD Thesis   July 12th, 2011   38 / 98
Representing the Aggregation of Tags

 PhD Thesis          Representing Resources Using Tags
   Arkaitz
   Zubiaga
                            Use of all tags (FTA), or just top 10 tags for each
Motivation
                            resource.
Selection of a
Classifier                   4 different weightings.
STS &
Datasets                    Example of a resource (100 users):
Representing                t1 (50), t2 (30), t3 (20), ..., t9 (1), t10 (1), ..., tn (1)
the
Aggregation of
Tags

Tag                                                                      FTA
                                                                  Top 10
Distributions
on STS

User Behavior                                t1       t2         t3 ...    t9    t10   ...      tn
on STS
                         Ranks                1       0.9        0.8 ... 0.2     0.1   ...       0
Conclusions &
Outlook                  Fractions           0.5      0.3        0.2 ... 0.02   0.01   ...     0.01
Publications             Binary               1        1          1  ...   1      1    ...       1
                         TF                  50       30         20 ...    2      1    ...       1


           Arkaitz Zubiaga (UNED)                         PhD Thesis               July 12th, 2011    39 / 98
Representing the Aggregation of Tags

 PhD Thesis          Representing Resources Using Other Data
   Arkaitz           Sources
   Zubiaga


Motivation

Selection of a
Classifier

STS &                       To represent resources using content and reviews:
Datasets

Representing
                               1    Removal of HTML tags.
the
Aggregation of
Tags                           2    Removal of stopwords.
Tag
Distributions
on STS                         3    Stem of remaining words.
User Behavior
on STS

Conclusions &
                               4    TF-IDF weighting of words.
Outlook

Publications




           Arkaitz Zubiaga (UNED)                         PhD Thesis   July 12th, 2011   40 / 98
Representing the Aggregation of Tags

 PhD Thesis          Experiment Setup
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier                   Multiclass SVMs.
STS &
Datasets

Representing
                            Show the average accuracy of 6 runs.
the
Aggregation of
Tags                        For clarity of presentation, we limit results to:
Tag
Distributions
                                    LCC taxonomy for books.
on STS                              Training sets of 6,000 URLs (6,616 (L1)/6,286 (L2) for
User Behavior                       test).
                                    Training sets of 18,000 books (8,861 (L1)/5,565 (L2) for
on STS

Conclusions &
Outlook                             test).
Publications




           Arkaitz Zubiaga (UNED)                         PhD Thesis        July 12th, 2011   41 / 98
Representing the Aggregation of Tags

 PhD Thesis          Experiment Setup
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier
                     Compared Representations
STS &                       Self-content (baseline).
Datasets

Representing
the                         Reviews.
Aggregation of
Tags

Tag                         Tags:
Distributions
on STS                              Ranks (Top 10).
User Behavior                       Fractions (Top 10 & FTA).
on STS
                                    Binary (Top 10 & FTA).
Conclusions &
Outlook                             TF (Top 10 & FTA).
Publications




           Arkaitz Zubiaga (UNED)                         PhD Thesis   July 12th, 2011   42 / 98
Representing the Aggregation of Tags

 PhD Thesis          Results of Tags vs Other Data Sources
   Arkaitz
   Zubiaga


Motivation
                                                             Delicious         LThing        GReads
Selection of a
                                                             L1    L2         L1   L2       L1   L2
Classifier
                                                             (17)      (243) (20)    (204) (20)           (204)
STS &
Datasets                Content                              .610      .470   .807   .673   .807          .673
Representing            Reviews                              .646      .524   .828   .705   .828          .705
                          Ranks
the
Aggregation of                                               .484      .360   .795   .511   .630          .405
                          Fractions (10)
Tags
                                                             .464      .349   .738   .411   .663          .427
Tag
                          Fractions (FTA)                    .461      .336   .712   .409   .654          .432
                       Tags




Distributions

                          Binary (10)
on STS
                                                             .531      .361   .770   .550   .623          .422
User Behavior
on STS                    Binary (FTA)                       .572      .529   .655   .606   .639          .481
Conclusions &             TF (10)                            .654      .545   .855   .722   .713          .491
Outlook

Publications
                          TF (FTA)                           .680      .568   .857   .736   .731          .517

                              Usually, FTA > 10.

           Arkaitz Zubiaga (UNED)                         PhD Thesis                    July 12th, 2011      43 / 98
Representing the Aggregation of Tags

 PhD Thesis          Results of Tags vs Other Data Sources
   Arkaitz
   Zubiaga


Motivation
                                                             Delicious         LThing        GReads
Selection of a
                                                             L1    L2         L1   L2       L1   L2
Classifier                                                    (17)      (243) (20)    (204) (20)           (204)
STS &
Datasets                Content                              .610      .470   .807   .673   .807          .673
Representing            Reviews                              .646      .524   .828   .705   .828          .705
                          Ranks
the
Aggregation of                                               .484      .360   .795   .511   .630          .405
Tags
                          Fractions (10)                     .464      .349   .738   .411   .663          .427
                          Fractions (FTA)
Tag
                                                             .461      .336   .712   .409   .654          .432
                       Tags




Distributions

                          Binary (10)
on STS
                                                             .531      .361   .770   .550   .623          .422
User Behavior
on STS                    Binary (FTA)                       .572      .529   .655   .606   .639          .481
Conclusions &             TF (10)                            .654      .545   .855   .722   .713          .491
Outlook
                          TF (FTA)                           .680      .568   .857   .736   .731          .517
Publications



                              TF (FTA) is the best approach for tags.

           Arkaitz Zubiaga (UNED)                         PhD Thesis                    July 12th, 2011      44 / 98
Representing the Aggregation of Tags

 PhD Thesis          Results of Tags vs Other Data Sources
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier                                           Delicious            LThing      GReads
STS &
Datasets                                            L1    L2            L1   L2     L1   L2
Representing                                        (17)         (243) (20)   (204) (20)    (204)
the
Aggregation of                      Content         .610 .470 .807 .673 .807 .673
Tags
                                    Reviews         .646 .524 .828 .705 .828 .705
Tag
Distributions                       Tags            .680 .568 .857 .736 .731 .517
on STS

User Behavior
on STS
                            Tags clearly outperform content and reviews on
Conclusions &
Outlook                     Delicious and LibraryThing.
Publications




           Arkaitz Zubiaga (UNED)                          PhD Thesis                  July 12th, 2011   45 / 98
Representing the Aggregation of Tags

 PhD Thesis          Results of Tags vs Other Data Sources
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier                                           Delicious            LThing      GReads
STS &
Datasets                                            L1    L2            L1   L2     L1   L2
Representing                                        (17)         (243) (20)   (204) (20)    (204)
the
Aggregation of                      Content         .610 .470 .807 .673 .807 .673
Tags
                                    Reviews         .646 .524 .828 .705 .828 .705
Tag
Distributions                       Tags            .680 .568 .857 .736 .731 .517
on STS

User Behavior
on STS
                            GoodReads’ disencouragement to tagging makes it
Conclusions &
Outlook                     insufficient to outperform content and reviews.
Publications




           Arkaitz Zubiaga (UNED)                          PhD Thesis                  July 12th, 2011   46 / 98
Representing the Aggregation of Tags

 PhD Thesis          Results of Tags vs Other Data Sources
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
                                                    Delicious            LThing      GReads
Datasets                                            L1    L2            L1   L2     L1   L2
Representing
the
                                                    (17)         (243) (20)   (204) (20)    (204)
Aggregation of
Tags
                                    Content         .610 .470 .807 .673 .807 .673
Tag                                 Reviews         .646 .524 .828 .705 .828 .705
Distributions
on STS                              Tags            .680 .568 .857 .736 .731 .517
User Behavior
on STS

Conclusions &
                            Tags are also useful for deeper categorization (L2).
Outlook

Publications




           Arkaitz Zubiaga (UNED)                          PhD Thesis                  July 12th, 2011   47 / 98
Representing the Aggregation of Tags

 PhD Thesis          Classifier Committees
   Arkaitz
   Zubiaga
                            Despite the superiority of social tags, all data sources
Motivation                  perform well.
Selection of a
Classifier

STS &
                            Their outputs can be combined by using classifier
Datasets                    committees.
Representing
the
Aggregation of
Tags
                            Classifier committees add up margins (i.e., reliability
Tag                         values) outputted by several classifiers, and provide a
Distributions
on STS                      single combined prediction.
User Behavior
on STS

Conclusions &
                                                                 Cat. #1   Cat. #2       Cat. #3
Outlook
                         Classif. A                                1.2       1.1           0.6
Publications
                         Classif. B                                0.5       1.0           1.2
                         Classif. committees                       1.7       2.1           1.8

           Arkaitz Zubiaga (UNED)                         PhD Thesis                 July 12th, 2011   48 / 98
Representing the Aggregation of Tags

 PhD Thesis          Results of Classifier Committees
   Arkaitz
   Zubiaga
                                                         Delicious         LThing        GReads
Motivation
                                                         L1    L2         L1   L2       L1   L2
Selection of a
Classifier                                                (17)      (243) (20)    (204) (20)       (204)
STS &                       Content (C)                  .610      .470   .807   .673   .807      .673
Datasets
                            Reviews (R)                  .646      .524   .828   .705   .828      .705
Representing
the                         Tags (T)                     .680      .568   .857   .736   .731      .517
Aggregation of
                              C+R                        .670      .547   .817   .704   .817      .704
                           Commit.



Tags

Tag
Distributions
                              C+T                        .696      .587   .821   .720   .832      .696
on STS                        R+T                        .694      .584   .859   .755   .857      .730
User Behavior
on STS
                              C+R+T                      .699      .588   .827   .732   .843      .727
Conclusions &
Outlook
                             Classifier committees successfully improve performance.
Publications


                                     Even on GoodReads, where tags were not good enough
                                     on their own.

           Arkaitz Zubiaga (UNED)                         PhD Thesis                    July 12th, 2011   49 / 98
Representing the Aggregation of Tags

 PhD Thesis          Results of Classifier Committees
   Arkaitz
   Zubiaga
                                                         Delicious         LThing        GReads
Motivation
                                                         L1    L2         L1   L2       L1   L2
Selection of a
Classifier                                                (17)      (243) (20)    (204) (20)       (204)
STS &                       Content (C)                  .610      .470   .807   .673   .807      .673
                            Reviews (R)
Datasets
                                                         .646      .524   .828   .705   .828      .705
Representing
the                         Tags (T)                     .680      .568   .857   .736   .731      .517
Aggregation of
                              C+R                        .670      .547   .817   .704   .817      .704
                           Commit.



Tags

Tag                           C+T                        .696      .587   .821   .720   .832      .696
Distributions
on STS                        R+T                        .694      .584   .859   .755   .857      .730
User Behavior                 C+R+T                      .699      .588   .827   .732   .843      .727
on STS

Conclusions &
Outlook
                             Data sources must be chosen with care:
Publications
                                     All 3 are helpful on Delicious.
                                     Content is harmful for books. Inappropriate considering
                                     synopses and ed. reviews as a summary of content?

           Arkaitz Zubiaga (UNED)                         PhD Thesis                    July 12th, 2011   50 / 98
Representing the Aggregation of Tags

 PhD Thesis          Summary of Results
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier
                            Better represent using all tags with TF weighting.
STS &
Datasets

Representing                Tags perform accurately even for deeper levels.
the
Aggregation of                      The system must encourage the user to tag to make it
Tags
                                    useful enough.
Tag
Distributions
on STS
                            Tags can be combined with other data to improve
User Behavior
on STS                      performance.
Conclusions &                       Combined data sources must be chosen with care.
Outlook

Publications




           Arkaitz Zubiaga (UNED)                         PhD Thesis      July 12th, 2011   51 / 98
Tag Distributions on STS

 PhD Thesis          Index
   Arkaitz
   Zubiaga
                     1   Motivation
Motivation

Selection of a
Classifier            2   Selection of a Classifier
STS &
Datasets
                     3   STS & Datasets
Representing
the
Aggregation of
Tags
                     4   Representing the Aggregation of Tags
Tag
Distributions        5   Tag Distributions on STS
on STS

User Behavior
on STS               6   User Behavior on STS
Conclusions &
Outlook
                     7   Conclusions & Outlook
Publications

                     8   Publications


           Arkaitz Zubiaga (UNED)                       PhD Thesis   July 12th, 2011   52 / 98
Tag Distributions on STS

 PhD Thesis          Tag Distributions
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier


                            So far, we have considered that tags annotated by the
STS &
Datasets

Representing                same number of users are equally representative to
the
Aggregation of              the resource.
Tags

Tag
Distributions
on STS
                            Distributions of tags in a collection could help
User Behavior
                            determine representativity of tags.
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                       PhD Thesis     July 12th, 2011   53 / 98
Tag Distributions on STS

 PhD Thesis          TF-IDF
   Arkaitz
   Zubiaga


Motivation

Selection of a       TF-IDF is an inverse weighting function (IWF) that
Classifier
                     computes:
STS &
Datasets                    the term frequency (TF).
Representing
the                         the inverse document frequency (IDF).
Aggregation of
Tags

Tag
                                                                          |D|
Distributions                              tf -idfij = tfij × log
on STS                                                               |{d : ti ∈ d}|
User Behavior
on STS                      High IDF value for terms appearing in few documents.
                            Low IDF value for terms appearing in many documents.
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                       PhD Thesis                    July 12th, 2011   54 / 98
Tag Distributions on STS

 PhD Thesis          Tag Weighting Functions
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier                   Analogous to TF-IDF on folksonomies:
STS &
Datasets                            TF-IRF → distributions across resources.
Representing
                                    TF-IUF → distributions across users.
the                                 TF-IBF → distributions across bookmarks.
Aggregation of
Tags

Tag
Distributions
                            TF-IRF and TF-IUF had been barely used, and their
on STS
                            suitability was yet unexplored.
User Behavior
on STS

Conclusions &               TF-IBF had not been used.
Outlook

Publications




           Arkaitz Zubiaga (UNED)                        PhD Thesis       July 12th, 2011   55 / 98
Tag Distributions on STS

 PhD Thesis          Results Using IWFs
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier                                              Delicious         LThing      GReads
STS &                                                 L1     L2        L1    L2    L1    L2
Datasets
                               TF                     .680 .568        .857 .736   .731 .517
Representing
the                                  TF-IRF           .639 .529        .894 .809   .799 .622
                              IWFs

Aggregation of
Tags                                 TF-IBF           .641 .532        .895 .811   .800 .628
Tag                                  TF-IUF           .661 .555        .892 .803   .794 .623
Distributions
on STS


                            All 3 IWFs clearly outperform TF for LibraryThing and
User Behavior
on STS

Conclusions &               GoodReads.
                                     Similar performance of IWFs.
Outlook

Publications




           Arkaitz Zubiaga (UNED)                         PhD Thesis                July 12th, 2011   56 / 98
Tag Distributions on STS

 PhD Thesis          Results Using IWFs
   Arkaitz
   Zubiaga


Motivation

Selection of a                                         Delicious         LThing      GReads
Classifier

STS &
                                                      L1     L2        L1    L2    L1    L2
Datasets                       TF                     .680 .568        .857 .736   .731 .517
Representing
the
                                     TF-IRF           .639 .529        .894 .809   .799 .622
                              IWFs

Aggregation of
Tags
                                     TF-IBF           .641 .532        .895 .811   .800 .628
Tag                                  TF-IUF           .661 .555        .892 .803   .794 .623
Distributions
on STS

User Behavior
on STS
                            IWFs underperform on Delicious, due to tag
                            suggestions that make top tags utmost popular.
Conclusions &
Outlook                              IUF superior to IBF and IRF. Users who make their
Publications                         own choices make the difference.




           Arkaitz Zubiaga (UNED)                         PhD Thesis                July 12th, 2011   57 / 98
Tag Distributions on STS

 PhD Thesis          Using Classifier Committees with IWFs
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets

Representing
the                  How about using tags represented with IWFs on classifier
Aggregation of
Tags                 committees?
Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                       PhD Thesis   July 12th, 2011   58 / 98
Tag Distributions on STS

 PhD Thesis          Results Using IWF with Committees
   Arkaitz
   Zubiaga


Motivation

Selection of a                                         Delicious         LThing      GReads
Classifier

STS &
                                                      L1     L2        L1    L2    L1    L2
Datasets                       TF                     .699 .588        .859 .755   .857 .730
Representing
the
                                     TF-IRF           .697 .592        .885 .793   .864 .748
                              IWFs

Aggregation of
Tags
                                     TF-IBF           .698 .592        .887 .797   .866 .751
Tag
                                     TF-IUF           .700 .595        .885 .792   .864 .749
Distributions
on STS

User Behavior               IWF-based committes are even better than TF-based
                            ones.
on STS

Conclusions &
Outlook                              Even on Delicious, where IWFs were not appropriate,
Publications                         committees perform slightly better.




           Arkaitz Zubiaga (UNED)                         PhD Thesis                July 12th, 2011   59 / 98
Tag Distributions on STS

 PhD Thesis          Results Using IWF with Committees
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier                                             Delicious         LThing      GReads
STS &                                                L1     L2        L1    L2    L1    L2
                               TF
Datasets
                                                     .699 .588        .859 .755   .857 .730
Representing
the                                  TF-IRF          .697 .592        .885 .793   .864 .748
                              IWFs

Aggregation of
Tags                                 TF-IBF          .698 .592        .887 .797   .866 .751
Tag                                  TF-IUF          .700 .595        .885 .792   .864 .749
Distributions
on STS

User Behavior
on STS                      Despite this outperformance of IWFs using committees,
Conclusions &               IWFs on their own perform better on LibraryThing
                            (.895 & .811).
Outlook

Publications




           Arkaitz Zubiaga (UNED)                        PhD Thesis                July 12th, 2011   60 / 98
Tag Distributions on STS

 PhD Thesis          Summary of Results Using IWFs
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &                       IWFs are an appropriate way to weight tags when used
                            on classifier committees.
Datasets

Representing
the                                 The exception is LibraryThing, where tags on their own
Aggregation of
Tags                                perform better.
Tag
Distributions
on STS                      Combined data sources must be appropriately chosen
User Behavior               (e.g., synopses & ed. reviews are harmful with books).
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                        PhD Thesis        July 12th, 2011   61 / 98
User Behavior on STS

 PhD Thesis          Index
   Arkaitz
   Zubiaga
                     1   Motivation
Motivation

Selection of a
Classifier            2   Selection of a Classifier
STS &
Datasets
                     3   STS & Datasets
Representing
the
Aggregation of
Tags
                     4   Representing the Aggregation of Tags
Tag
Distributions        5   Tag Distributions on STS
on STS

User Behavior
on STS               6   User Behavior on STS
Conclusions &
Outlook
                     7   Conclusions & Outlook
Publications

                     8   Publications


           Arkaitz Zubiaga (UNED)                   PhD Thesis   July 12th, 2011   62 / 98
User Behavior on STS

 PhD Thesis          User Behavior: Categorizers and Describers
   Arkaitz
   Zubiaga



                              K¨rner1 suggested 2 kinds of user behavior:
Motivation
                               o
Selection of a
Classifier

STS &                                                                      Categorizer               Describer
Datasets                      Goal of Tagging                             later browsing           later retrieval
Representing
the
                              Change of Tag Vocabulary                         costly                  cheap
Aggregation of                Size of Tag Vocabulary                          limited                   open
Tags
                              Tags                                          subjective                objective
Tag
Distributions
on STS

User Behavior                 They found that Describers help infer semantic
on STS
                              relations among tags.
Conclusions &
Outlook                       Do these tagging behaviors affect the usefulness of tags
Publications                  for resource classification?


                         1
                             C. K¨rner. Understanding the Motivation behind Tagging. Hypertext 2009.
                                 o
           Arkaitz Zubiaga (UNED)                            PhD Thesis                                July 12th, 2011   63 / 98
User Behavior on STS

 PhD Thesis          Categorizers and Describers
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                   PhD Thesis   July 12th, 2011   64 / 98
User Behavior on STS

 PhD Thesis          Categorizers and Describers
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                   PhD Thesis   July 12th, 2011   65 / 98
User Behavior on STS

 PhD Thesis          Weighting Measures
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets

Representing         We use 3 measures to weight users, based on Koerner et al.
the
Aggregation of
                     (2010).
Tags

Tag
                            2 factors are considered: verbosity & diversity.
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                   PhD Thesis         July 12th, 2011   66 / 98
User Behavior on STS

 PhD Thesis          Weighting Measures: TPP
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets
                            Tags per Post (TPP) – Verbosity
Representing
the
Aggregation of                                                   r
Tags
                                                                      |Tur |
Tag                                                  TPP(u) =
Distributions                                                        |Ru |
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                   PhD Thesis                 July 12th, 2011   67 / 98
User Behavior on STS

 PhD Thesis          Weighting Measures: ORPHAN
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets
                            Orphan Ratio (ORPHAN) – Diversity
Representing
the                                                              |R(tmax )|
Aggregation of                                             n=
Tags                                                               100
Tag
Distributions                                                 o
                                                            |Tu | o
on STS
                                    ORPHAN(u) =                  , T = {t||R(t)| ≤ n}
User Behavior                                               |Tu | u
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                   PhD Thesis                July 12th, 2011   68 / 98
User Behavior on STS

 PhD Thesis          Weighting Measures: TRR
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets

Representing
                            Tag Resource Ratio (TRR) – Verbosity + Diversity
the
Aggregation of
Tags                                                                  |Tu |
                                                           TRR(u) =
Tag                                                                   |Ru |
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                   PhD Thesis                July 12th, 2011   69 / 98
User Behavior on STS

 PhD Thesis          Use of Weighting Measures
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier                   These 3 measures provide:
STS &
Datasets
                                    A weight for each user.
                                    Ranking of users according to each measure.
Representing
the
Aggregation of
Tags                        From rankings → subsets of users as extreme
Tag
Distributions
                            Categorizers (highest-ranked) and extreme Describers
on STS                      (lowest-ranked).
User Behavior
on STS

Conclusions &               Subsets range from 10% to 100% (step size = 10%).
Outlook

Publications




           Arkaitz Zubiaga (UNED)                      PhD Thesis          July 12th, 2011   70 / 98
User Behavior on STS

 PhD Thesis          Experiment Setup
   Arkaitz
   Zubiaga
                            We select subsets of users according to number of tag
Motivation                  assignments.
Selection of a
Classifier

STS &
                            Selecting by percents of users would be unfair →
Datasets                    different amounts of data.
Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                   PhD Thesis       July 12th, 2011   71 / 98
User Behavior on STS

 PhD Thesis          Experiments
   Arkaitz
   Zubiaga


Motivation
                     Classification
Selection of a              We use a multiclass SVM, with TF weighting of tags.
Classifier

STS &
                     Descriptivity
Datasets
                         Vectorial representations of resources:
                                    Tr → tag frequencies.
Representing
the
Aggregation of
Tags
                                    Rr → term frequencies on descriptive data
Tag
                                    (self-content).
Distributions
on STS

User Behavior               Cosine similarity between Tr and Rr :
on STS

Conclusions &                                  n
Outlook                                                               Tri × Rri
                               cos(θr ) =
                                                              n          2        n         2
Publications
                                              i=1             i=1 (Tri )     ×    i=1 (Rri )




           Arkaitz Zubiaga (UNED)                      PhD Thesis                        July 12th, 2011   72 / 98
User Behavior on STS

 PhD Thesis          Descriptivity Results
   Arkaitz
   Zubiaga                                 TPP (Verb.)         ORPHAN (Div.) TRR (V. + D.)
Motivation

Selection of a

                            Delicious
Classifier

STS &
Datasets

Representing
the
                            LibraryThing


Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
                            GoodReads




Outlook

Publications




           Arkaitz Zubiaga (UNED)                            PhD Thesis           July 12th, 2011   73 / 98
User Behavior on STS

 PhD Thesis          Classification Results
   Arkaitz
   Zubiaga                                 TPP (Verb.)         ORPHAN (Div.) TRR (V. + D.)
Motivation



                            Delicious
Selection of a
Classifier

STS &
Datasets

Representing
the
                            LibraryThing


Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
                            GoodReads




Outlook

Publications




           Arkaitz Zubiaga (UNED)                            PhD Thesis           July 12th, 2011   74 / 98
User Behavior on STS

 PhD Thesis          Classification Results
   Arkaitz
   Zubiaga                                 TPP (Verb.)         ORPHAN (Div.) TRR (V. + D.)
Motivation



                            Delicious
Selection of a
Classifier

STS &
Datasets

Representing
the
                            LibraryThing


Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
                            GoodReads




Outlook

Publications




           Arkaitz Zubiaga (UNED)                            PhD Thesis           July 12th, 2011   75 / 98
User Behavior on STS

 PhD Thesis          Overall Categorizers/Describers Results
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &

                            Discriminating by verbosity (TPP) does best for finding
Datasets

Representing
the                         extreme Categorizers.
Aggregation of
Tags

Tag
Distributions
                            The use of non-descriptive tags provide more accurate
on STS                      classification.
User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                   PhD Thesis      July 12th, 2011   76 / 98
Conclusions & Outlook

 PhD Thesis          Index
   Arkaitz
   Zubiaga
                     1   Motivation
Motivation

Selection of a
Classifier            2   Selection of a Classifier
STS &
Datasets
                     3   STS & Datasets
Representing
the
Aggregation of
Tags
                     4   Representing the Aggregation of Tags
Tag
Distributions        5   Tag Distributions on STS
on STS

User Behavior
on STS               6   User Behavior on STS
Conclusions &
Outlook
                     7   Conclusions & Outlook
Publications

                     8   Publications


           Arkaitz Zubiaga (UNED)                    PhD Thesis   July 12th, 2011   77 / 98
Conclusions & Outlook

 PhD Thesis          Contributions
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets                    Generation & analysis of 3 large-scale social tagging
Representing                datasets.
the
Aggregation of

                            Release of some tagging datasets, used by Godoy and
Tags

Tag
Distributions               Amandi (2010), Strohmaier et al. (2010), Li et al. (2011),
on STS
                            and Ares et al. (2011).
User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis        July 12th, 2011   78 / 98
Conclusions & Outlook

 PhD Thesis          Contributions
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier
                            First research work performing actual classification
STS &
Datasets                    experiments using social tags.
Representing                        Analysis of different representations of social tags.
the
Aggregation of                      Analysis of effect of tag distributions.
Tags
                                    Study of user behavior.
Tag
Distributions

                            It paves the way to future researchers interested in the
on STS

User Behavior
on STS                      task & in the exploration of STS.
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                      PhD Thesis             July 12th, 2011   79 / 98
Conclusions & Outlook

 PhD Thesis          Research Questions
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier


                            Apart from the Problem Statement:
STS &
Datasets

Representing                        How can the annotations provided by users on social
the
Aggregation of                      tagging systems be exploited to improve the accuracy of
Tags                                a resource classification task?
Tag
Distributions
on STS
                            We set forth 10 research questions.
User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                      PhD Thesis           July 12th, 2011   80 / 98
Conclusions & Outlook

 PhD Thesis          Research Questions (1)
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets

Representing                   RQ 1 What is a suitable SVM classifier for the task?
the
Aggregation of
Tags
                                             Native multiclass SVM >> Combinations of
Tag
Distributions                                binary SVMs.
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis        July 12th, 2011   81 / 98
Conclusions & Outlook

 PhD Thesis          Research Questions (2)
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &                          RQ 2 What is a suitable learning method for the
Datasets
                                    task?
Representing
the
Aggregation of
                                        Supervised Semi-supervised.
Tags

Tag
Distributions                                Unlike for binary tasks, where
on STS
                                             Semi-supervised >> Supervised (Joachims,
User Behavior
on STS                                       1999).
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis         July 12th, 2011   82 / 98
Conclusions & Outlook

 PhD Thesis          Research Questions (3)
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets
                               RQ 3 How do the settings of STS affect
Representing
                                    folksonomies?
the
Aggregation of
Tags                                         Great impact of tag suggestions.
Tag
Distributions
on STS
                                             Importance of encouraging users to
User Behavior
on STS                                       annotate.
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis         July 12th, 2011   83 / 98
Conclusions & Outlook

 PhD Thesis          Research Questions (4)
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
                               RQ 4 How to amalgamate annotations to get a
Datasets                            representation of a resource?
Representing
the
Aggregation of
Tags
                                             Considering all the tags rather than only
Tag
                                             those in the top.
Distributions
on STS

User Behavior                                Weighting tags according to number of
on STS
                                             users annotating them.
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis           July 12th, 2011   84 / 98
Conclusions & Outlook

 PhD Thesis          Research Questions (5)
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &                          RQ 5 Is it worthwhile combining tags with other
Datasets
                                    data sources?
Representing
the
Aggregation of
Tags
                                             Combining different data sources helps
Tag
                                             improve performance.
Distributions
on STS

User Behavior                                Data sources must be appropriately
on STS
                                             chosen.
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis         July 12th, 2011   85 / 98
Conclusions & Outlook

 PhD Thesis          Research Questions (6)
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets
                               RQ 6 Are social tags specific enough to classify into
Representing
                                    narrower categories?
the
Aggregation of
Tags                                         Tags are as useful as for top level.
Tag
Distributions
on STS                                       Noll and Meinel (2008) → tags were
User Behavior
on STS
                                             probably not useful for deeper levels.
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis           July 12th, 2011   86 / 98
Conclusions & Outlook

 PhD Thesis          Research Questions (7)
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier
                               RQ 7 Can we consider tag distributions to get the
STS &
Datasets                            representativity of each tag?
Representing
the
Aggregation of                               LibraryThing & GoodReads: really useful.
Tags

Tag
Distributions                                Delicious: not useful, because of tag
                                             suggestions
on STS

User Behavior
on STS                                       → need of committees to make them
Conclusions &                                useful.
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis          July 12th, 2011   87 / 98
Conclusions & Outlook

 PhD Thesis          Research Questions (8)
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
                               RQ 8 What approach to use to weigh the
Datasets                            representativity of tags?
Representing
the
Aggregation of
Tags
                                             LibraryThing & GoodReads: IBF, IRF &
Tag
                                             IUF are very similar.
Distributions
on STS

User Behavior                                Delicious: IUF clearly superior, because of
on STS
                                             users that get rid of suggestions.
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis           July 12th, 2011   88 / 98
Conclusions & Outlook

 PhD Thesis          Research Questions (9)
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
                               RQ 9 Can we discriminate users who further resemble
Datasets                            an expert classification?
Representing
the
Aggregation of
Tags
                                             Categorizers > Describers for
Tag
                                             classification.
Distributions
on STS

User Behavior                                Need of appropriate measure for
on STS
                                             discriminating.
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis         July 12th, 2011   89 / 98
Conclusions & Outlook

 PhD Thesis          Research Questions (10)
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets
                             RQ 10 What features identify a Categorizer?
Representing
the
Aggregation of
                                             Categorizers can be found when
Tags                                         discriminating by verbosity.
Tag
Distributions
on STS
                                             Non-descriptive tags produce more
User Behavior
on STS                                       accurate classification.
Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis        July 12th, 2011   90 / 98
Conclusions & Outlook

 PhD Thesis          Future Directions
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier
                            Increase of interest in the field, still much work to do.
STS &
Datasets
                            We have considered each tag as a diferent token.
                            → Considering semantic meanings of social tags could
Representing
the
Aggregation of
Tags                        help.
Tag
Distributions
on STS                      Tag suggestions leverage several issues in folksonomies.
User Behavior               → Looking for a weighting function that fits the
on STS

Conclusions &
                            characteristics of systems with tag suggestions, e.g.,
Outlook                     Delicious.
Publications




           Arkaitz Zubiaga (UNED)                    PhD Thesis       July 12th, 2011   91 / 98
Publications

 PhD Thesis          Index
   Arkaitz
   Zubiaga
                     1   Motivation
Motivation

Selection of a
Classifier            2   Selection of a Classifier
STS &
Datasets
                     3   STS & Datasets
Representing
the
Aggregation of
Tags
                     4   Representing the Aggregation of Tags
Tag
Distributions        5   Tag Distributions on STS
on STS

User Behavior
on STS               6   User Behavior on STS
Conclusions &
Outlook
                     7   Conclusions & Outlook
Publications

                     8   Publications


           Arkaitz Zubiaga (UNED)                PhD Thesis     July 12th, 2011   92 / 98
Publications

 PhD Thesis          Publications
   Arkaitz
   Zubiaga


Motivation
                     Peer-Reviewed Conferences (I)
Selection of a              Arkaitz Zubiaga, Christian K¨rner, Markus Strohmaier.
                                                        o
Classifier
                            2011. Tags vs Shelves: From Social Tagging to Social
STS &
Datasets                    Classification. In Proceedings of Hypertext 2011, the
Representing                22nd ACM Conference on Hypertext and Hypermedia,
the
Aggregation of              Eindhoven, Netherlands. (acceptance rate: 35/104, 34%)
Tags

Tag
Distributions
on STS
                            Arkaitz Zubiaga, Raquel Mart´ ınez, V´
                                                                 ıctor Fresno. 2009.
User Behavior
                            Getting the Most Out of Social Annotations for Web Page
on STS                      Classification. In Proceedings of DocEng 2009, the 9th
Conclusions &
Outlook
                            ACM Symposium on Document Engineering, pp. 74-83,
Publications                Munich, Germany. (acceptance rate: 16/54, 29.6%)
                            [15 citations]



           Arkaitz Zubiaga (UNED)                PhD Thesis          July 12th, 2011   93 / 98
Publications

 PhD Thesis          Publications
   Arkaitz
   Zubiaga


Motivation           Peer-Reviewed Conferences (II)
Selection of a
Classifier                   Arkaitz Zubiaga. 2009. Enhancing Navigation on
STS &                       Wikipedia with Social Tags. Wikimania 2009, Buenos
Datasets
                            Aires, Argentina.
Representing
the                         [6 citations]
Aggregation of
Tags

Tag                         Arkaitz Zubiaga, Alberto P. Garc´ ıa-Plaza, V´
                                                                         ıctor Fresno,
Distributions
on STS                      Raquel Mart´  ınez. 2009. Content-based Clustering for Tag
User Behavior               Cloud Visualization. In Proceedings of ASONAM 2009,
on STS

Conclusions &
                            International Conference on Advances in Social Networks
Outlook                     Analysis and Mining, pp. 316-319, Athens, Greece.
Publications                [3 citations]



           Arkaitz Zubiaga (UNED)                PhD Thesis             July 12th, 2011   94 / 98
Publications

 PhD Thesis          Publications
   Arkaitz
   Zubiaga
                     Journals (I)
Motivation
                            Arkaitz Zubiaga, Raquel Mart´
                                                        ınez, V´
                                                               ıctor Fresno. 2011.
Selection of a
Classifier                   Augmenting Web Page Classifiers with Social Annotations.
STS &                       Procesamiento del Lenguaje Natural. (acceptance rate:
Datasets                    33/60, 55%)
Representing
the
Aggregation of
Tags
                            Arkaitz Zubiaga, Raquel Mart´
                                                        ınez, V´
                                                               ıctor Fresno. 2009.
Tag
                            Clasificaci´n de P´ginas Web con Anotaciones Sociales.
                                      o      a
Distributions               Procesamiento del Lenguaje Natural, vol. 43, pp. 225-233.
on STS
                            (acceptance rate: 36/72, 50%)
User Behavior
on STS

Conclusions &               Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´
                                                                        ınez. 2009.
Outlook
                            Comparativa de Aproximaciones a SVM Semisupervisado
Publications
                            Multiclase para Clasificaci´n de P´ginas Web. Procesamiento
                                                      o       a
                            del Lenguaje Natural, vol. 42, pp. 63-70.


           Arkaitz Zubiaga (UNED)                 PhD Thesis            July 12th, 2011   95 / 98
Publications

 PhD Thesis          Publications
   Arkaitz
   Zubiaga


Motivation

Selection of a
Classifier

STS &
Datasets             Journals (II)
Representing
the                         Arkaitz Zubiaga, V´
                                              ıctor Fresno, Raquel Mart´
                                                                       ınez. Harnessing
Aggregation of
Tags
                            Folksonomies to Produce a Social Classification of Resources.
                            IEEE Transactions on Knowledge and Data Engineering.
Tag
Distributions               (pending notification)
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications




           Arkaitz Zubiaga (UNED)                 PhD Thesis              July 12th, 2011   96 / 98
Publications

 PhD Thesis          Publications
   Arkaitz
   Zubiaga


Motivation           Book Chapters
Selection of a
Classifier
                            Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´
                                                                        ınez. 2011.
STS &                       Exploiting Social Annotations for Resource Classification.
Datasets
                            Social Network Mining, Analysis and Research
Representing
the                         Trends: Techniques and Applications. IGI Global.
Aggregation of
Tags                 Workshops
Tag
Distributions               Arkaitz Zubiaga, V´
                                              ıctor Fresno, Raquel Mart´ınez. 2009. Is
on STS
                            Unlabeled Data Suitable for Multiclass SVM-based Web
User Behavior
on STS                      Page Classification?. In Proceedings of the NAACL-HLT
Conclusions &               2009 Workshop on Semi-supervised Learning for
                            Natural Language Processing, pp. 28-36, Boulder, CO,
Outlook

Publications
                            United States.



           Arkaitz Zubiaga (UNED)                PhD Thesis            July 12th, 2011   97 / 98
Publications

 PhD Thesis          Thank You
   Arkaitz
   Zubiaga


Motivation

Selection of a
                        Achiu       Arigato                Danke Dhannvaad Dua Netjer en ek
Classifier
                        Efcharisto Gracias Gr`cies Gratia Grazie
                                             a
                     Guishepeli Hvala Kiitos K¨sz¨n¨m
                                                o o o
STS &
Datasets

Representing

                        Merc´ Merci Mila esker Obrigado
the
Aggregation of
Tags
                             e
Tag
Distributions        Shukran      Tack Tak Takk T¨anan Tapadh leat
                                      Shukriya
on STS

User Behavior
on STS
                     Tesekk¨r ederim Thank you Toda
                              u
Conclusions &
Outlook

Publications

                                          http://thesis.zubiaga.org/



           Arkaitz Zubiaga (UNED)                   PhD Thesis                 July 12th, 2011   98 / 98

More Related Content

Similar to Harnessing Folksonomies for Resource Classification

Policy Lunchbox - Digital Science
Policy Lunchbox - Digital SciencePolicy Lunchbox - Digital Science
Policy Lunchbox - Digital ScienceKaitlin Thaney
 
Sansone bio sharing introduction
Sansone bio sharing introductionSansone bio sharing introduction
Sansone bio sharing introductionMIBBI Checklists
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisStuart Wrigley
 
The Future of Digital Science - World Science Forum 2011
The Future of Digital Science - World Science Forum 2011The Future of Digital Science - World Science Forum 2011
The Future of Digital Science - World Science Forum 2011Kaitlin Thaney
 
Of Categorizers and Describers: An Evaluation of Quantitative Measures for Ta...
Of Categorizers and Describers: An Evaluation of Quantitative Measures for Ta...Of Categorizers and Describers: An Evaluation of Quantitative Measures for Ta...
Of Categorizers and Describers: An Evaluation of Quantitative Measures for Ta...Christian Körner
 
Open Access in 2012 – a developing country perspective
Open Access in 2012 – a developing country perspectiveOpen Access in 2012 – a developing country perspective
Open Access in 2012 – a developing country perspectiveEve Gray
 
NIH BD2K DataMed data index - DATS model
NIH BD2K DataMed data index - DATS modelNIH BD2K DataMed data index - DATS model
NIH BD2K DataMed data index - DATS modelSusanna-Assunta Sansone
 
Slides by Y. Murayama at Japan-France Joint Meeting on Open Access and Open D...
Slides by Y. Murayama at Japan-France Joint Meeting on Open Access and Open D...Slides by Y. Murayama at Japan-France Joint Meeting on Open Access and Open D...
Slides by Y. Murayama at Japan-France Joint Meeting on Open Access and Open D...Yasuhiro Murayama
 
Repository Federation: Towards Data Interoperability
Repository Federation: Towards Data InteroperabilityRepository Federation: Towards Data Interoperability
Repository Federation: Towards Data InteroperabilityRobert H. McDonald
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identificationguest453b14
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identificationguest453b14
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identificationguest453b14
 
Dataset citation and identification
Dataset citation and identificationDataset citation and identification
Dataset citation and identificationAdam Farquhar
 
Advancing the fourth paradigm of research: Assimilating repositories into act...
Advancing the fourth paradigm of research: Assimilating repositories into act...Advancing the fourth paradigm of research: Assimilating repositories into act...
Advancing the fourth paradigm of research: Assimilating repositories into act...Tyler Walters
 
Metadata challenges research and re-usable data - BioSharing, ISA and STATO
Metadata challenges research and re-usable data - BioSharing, ISA and STATOMetadata challenges research and re-usable data - BioSharing, ISA and STATO
Metadata challenges research and re-usable data - BioSharing, ISA and STATOAlejandra Gonzalez-Beltran
 
GARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant ScienceGARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant ScienceDavid Johnson
 
Metadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data RepositoriesMetadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data Repositoriesandrea huang
 

Similar to Harnessing Folksonomies for Resource Classification (20)

Policy Lunchbox - Digital Science
Policy Lunchbox - Digital SciencePolicy Lunchbox - Digital Science
Policy Lunchbox - Digital Science
 
Sansone bio sharing introduction
Sansone bio sharing introductionSansone bio sharing introduction
Sansone bio sharing introduction
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log Analysis
 
The Future of Digital Science - World Science Forum 2011
The Future of Digital Science - World Science Forum 2011The Future of Digital Science - World Science Forum 2011
The Future of Digital Science - World Science Forum 2011
 
Of Categorizers and Describers: An Evaluation of Quantitative Measures for Ta...
Of Categorizers and Describers: An Evaluation of Quantitative Measures for Ta...Of Categorizers and Describers: An Evaluation of Quantitative Measures for Ta...
Of Categorizers and Describers: An Evaluation of Quantitative Measures for Ta...
 
Open Access in 2012 – a developing country perspective
Open Access in 2012 – a developing country perspectiveOpen Access in 2012 – a developing country perspective
Open Access in 2012 – a developing country perspective
 
NIH BD2K DataMed data index - DATS model
NIH BD2K DataMed data index - DATS modelNIH BD2K DataMed data index - DATS model
NIH BD2K DataMed data index - DATS model
 
Slides by Y. Murayama at Japan-France Joint Meeting on Open Access and Open D...
Slides by Y. Murayama at Japan-France Joint Meeting on Open Access and Open D...Slides by Y. Murayama at Japan-France Joint Meeting on Open Access and Open D...
Slides by Y. Murayama at Japan-France Joint Meeting on Open Access and Open D...
 
Repository Federation: Towards Data Interoperability
Repository Federation: Towards Data InteroperabilityRepository Federation: Towards Data Interoperability
Repository Federation: Towards Data Interoperability
 
ISA - a short overview - Dec 2013
ISA - a short overview - Dec 2013ISA - a short overview - Dec 2013
ISA - a short overview - Dec 2013
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identification
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identification
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identification
 
Dataset citation and identification
Dataset citation and identificationDataset citation and identification
Dataset citation and identification
 
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
 
Advancing the fourth paradigm of research: Assimilating repositories into act...
Advancing the fourth paradigm of research: Assimilating repositories into act...Advancing the fourth paradigm of research: Assimilating repositories into act...
Advancing the fourth paradigm of research: Assimilating repositories into act...
 
Metadata challenges research and re-usable data - BioSharing, ISA and STATO
Metadata challenges research and re-usable data - BioSharing, ISA and STATOMetadata challenges research and re-usable data - BioSharing, ISA and STATO
Metadata challenges research and re-usable data - BioSharing, ISA and STATO
 
GARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant ScienceGARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant Science
 
Metadata
MetadataMetadata
Metadata
 
Metadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data RepositoriesMetadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data Repositories
 

More from azubiaga

Exploiting context for rumour detection in social media
Exploiting context for rumour detection in social mediaExploiting context for rumour detection in social media
Exploiting context for rumour detection in social mediaazubiaga
 
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social MediaCrowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social Mediaazubiaga
 
Mining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information DiscoveryMining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information Discoveryazubiaga
 
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...azubiaga
 
Clasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones SocialesClasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones Socialesazubiaga
 
Content-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud VisualizationContent-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud Visualizationazubiaga
 
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?azubiaga
 
Master thesis presentation
Master thesis presentationMaster thesis presentation
Master thesis presentationazubiaga
 
Tags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social ClassificationTags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social Classificationazubiaga
 

More from azubiaga (9)

Exploiting context for rumour detection in social media
Exploiting context for rumour detection in social mediaExploiting context for rumour detection in social media
Exploiting context for rumour detection in social media
 
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social MediaCrowdsourcing the Annotation of Rumourous Conversations in Social Media
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
 
Mining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information DiscoveryMining Twitter for Real-Time Trend and Information Discovery
Mining Twitter for Real-Time Trend and Information Discovery
 
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
Newspaper Editors vs the Crowd: On the Appropriateness of Front Page News Sel...
 
Clasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones SocialesClasificación de Páginas Web con Anotaciones Sociales
Clasificación de Páginas Web con Anotaciones Sociales
 
Content-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud VisualizationContent-based Clustering for Tag Cloud Visualization
Content-based Clustering for Tag Cloud Visualization
 
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
 
Master thesis presentation
Master thesis presentationMaster thesis presentation
Master thesis presentation
 
Tags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social ClassificationTags vs Shelves: From Social Tagging to Social Classification
Tags vs Shelves: From Social Tagging to Social Classification
 

Recently uploaded

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Harnessing Folksonomies for Resource Classification

  • 1. Harnessing Folksonomies for Resource Classification PhD Thesis Arkaitz Zubiaga UNED July 12th, 2011 Advisors: Raquel Mart´ınez Unanue V´ ıctor Fresno Fern´ndez a
  • 2. PhD Thesis Table of Contents Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 2 / 98
  • 3. Motivation PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 3 / 98
  • 4. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 4 / 98
  • 5. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 5 / 98
  • 6. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 6 / 98
  • 7. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier Classifying resources is a common task. STS & Datasets Web pages, books, movies, files,... Representing the Aggregation of Large collections of resources → expensive & effortful Tags to classify manually. Tag Distributions LoC reported an average cost of $94.58 for cataloging on STS each book in 2002. User Behavior on STS Conclusions & Enormous costs and efforts → automatic classification. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 7 / 98
  • 8. Motivation PhD Thesis Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier Representation of resources → self-content. STS & Datasets Representing Use of self-content of resources presents some issues: the Aggregation of Not always representative enough. Tags Not always accessible (e.g., books). Tag Distributions on STS User Behavior Social tags provided by users → alternative to solve the on STS problem. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 8 / 98
  • 9. Motivation PhD Thesis Tagging Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications T1, T2, T3 = sets of tags. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 9 / 98
  • 10. Motivation PhD Thesis Social Tagging Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Aggregation of user annotations → folksonomy. Folksonomy: Folk (People) + Taxis (Classification) + Nomos (Management). Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 10 / 98
  • 11. Motivation PhD Thesis Organization of Resources Arkaitz Zubiaga Motivation Selection of a User annotations → own organization of resources. Classifier STS & Datasets A user’s tags Representing the Tag # Resources Aggregation of Tags research 82 Tag twitter 28 Distributions on STS web2.0 35 User Behavior language 42 on STS english 64 Conclusions & Outlook ... ... Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 11 / 98
  • 12. Motivation PhD Thesis Example of Bookmarks Arkaitz Zubiaga Motivation User Resource Tags Selection of a 1 user1 flickr.com photo, web2.0, social Classifier 2 user2 flickr.com photography, images STS & Datasets 3 user1 google.com searchengine Representing 4 user3 twitter.com microblogging, twitter the Aggregation of Tags Tag Bookmark: (1) user ui ∈ U who annotates (2) resource rj ∈ R being annotated Distributions on STS User Behavior (3) tags Tij = {t1 , ..., tn } ∈ T utilized. on STS Conclusions & Outlook Publications bij : ui × rj × Tij Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 12 / 98
  • 13. Motivation PhD Thesis Sum of Annotations Arkaitz Zubiaga Motivation Top tags (79,681 users) Selection of a Classifier Tag Rank Tag User Count STS & 1 photos 22,712 Datasets 2 flickr 19,046 Representing the 3 photography 15,968 Aggregation of Tags 4 photo 15,225 Tag 5 sharing 10,648 Distributions on STS 6 images 9,637 User Behavior 7 web2.0 9,528 on STS Conclusions & 8 community 4,571 Outlook 9 social 3,798 Publications 10 pictures 3,115 Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 13 / 98
  • 14. Motivation PhD Thesis Tag-based Resource Classification Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 14 / 98
  • 15. Motivation PhD Thesis Problem Statement Arkaitz Zubiaga Motivation How can the annotations provided by users on social tagging Selection of a systems be exploited to improve the accuracy of a resource Classifier classification task? STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 15 / 98
  • 16. Motivation PhD Thesis Related Work Arkaitz Zubiaga Motivation Selection of a Classifier STS & Social tags for information management: Datasets Search: Bao et al. (2007) & Heymann et al. (2008). Representing the Recommender Systems: Shepitsen et al. (2008) & Li Aggregation of Tags Tag et al. (2008). Distributions on STS User Behavior on STS Enhanced Browsing: Smith (2008). Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 16 / 98
  • 17. Motivation PhD Thesis Related Work Arkaitz Zubiaga Motivation Selection of a Classifier Classification: Noll and Meinel (2008) → statistical STS & Datasets analysis of matches between tags & taxonomies. Representing Tags are useful for broad categorization. the Not for narrower categorization. Aggregation of Tags Tag Distributions Lack of further research with: on STS Actual classification experiments. User Behavior Other types of resources. on STS Different representations of social tags. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 17 / 98
  • 18. Selection of a Classifier PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 18 / 98
  • 19. Selection of a Classifier PhD Thesis Characteristics of the task Arkaitz Zubiaga Motivation We have: Selection of a Classifier Large set of resources: some labeled + many unlabeled. STS & Multiclass taxonomy. Datasets Representing the Automated classifiers learn a model from labeled Aggregation of resources. Tags Tag This model is used to classify unlabeled resources Distributions afterward. on STS User Behavior on STS 2 learning settings: Conclusions & Outlook Supervised: only labeled resources considered for learning. Semi-supervised: unlabeled resources are also taken into Publications account. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 19 / 98
  • 20. Selection of a Classifier PhD Thesis Support Vector Machines (SVM) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Hyperplane that separates with largest margin. Outlook Publications Use of kernels → redimensions the space. Resource/Hyperplane margin → Classifier’s reliability. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 20 / 98
  • 21. Selection of a Classifier PhD Thesis Selection of a Classifier Arkaitz Zubiaga Motivation Selection of a Classifier SVMs solve binary problems by default. STS & Datasets Representing To solve multiclass tasks: the Aggregation of Native multiclass classifier (mSVM). Tags Combining binary classifiers: Tag one-against-all (oaaSVM). Distributions on STS one-against-one (oaoSVM). User Behavior on STS Conclusions & Both supervised (s) and semi-supervised (ss). Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 21 / 98
  • 22. Selection of a Classifier PhD Thesis Experiment Settings Arkaitz Zubiaga Motivation Selection of a 3 benchmark datasets to analyze suitability of Classifier classifiers: STS & Datasets Representing Dataset # web pages # trainset # categories BankSearch the Aggregation of 10,000 3,000 10 WebKB Tags 4,518 1,000 6 Tag Distributions Y! Science 788 100 6 on STS User Behavior on STS We present accuracy to show performance. Conclusions & Outlook Publications We perform 6 runs, and show the average accuracy. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 22 / 98
  • 23. Selection of a Classifier PhD Thesis Results Arkaitz Zubiaga Motivation BankSearch WebKB Y! Science Selection of a mSVM (s) .925 .810 .825 Classifier mSVM (ss) .923 .778 .836 STS & Datasets oaaSVM (s) .843 .776 .536 Representing the oaaSVM (ss) .842 .773 .565 Aggregation of Tags oaoSVM (s) .826 .775 .483 Tag oaoSVM (ss) .811 .754 .514 Distributions on STS User Behavior Native multiclass classifier performs best, while on STS Conclusions & supervised semi-supervised. Outlook Publications We used the supervised approach, as it is computationally less expensive. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 23 / 98
  • 24. STS & Datasets PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 24 / 98
  • 25. STS & Datasets PhD Thesis Requirements Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Selected STS should have: Representing Large communities involved. the Aggregation of Public access to data. Tags Consolidated taxonomies as a ground truth. Tag Distributions We chose Delicious, LibraryThing & GoodReads. on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 25 / 98
  • 26. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 26 / 98
  • 27. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 27 / 98
  • 28. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 28 / 98
  • 29. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 29 / 98
  • 30. STS & Datasets PhD Thesis Characteristics of STS Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LibraryThing GoodReads Datasets Resources web documents books books Representing the Tag suggestions based on earlier no based on earlier Aggregation of bookmarks on the tags utilized by Tags resource the user Tag Distributions Tag insertion space-separated comma-separated one by one text- on STS box User Behavior Saving a resource prompts user to prompts user to user needs to on STS add tags add tags at sec- click again to add Conclusions & ond step tags Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 30 / 98
  • 31. STS & Datasets PhD Thesis Retrieval of Categorized Resources Arkaitz Zubiaga Motivation Selection of a Classifier Retrieval of popular annotated resources, which were also STS & categorized by experts. Datasets Representing the Aggregation of Top level (L1) Second level (L2) Tags Tag Resources Classes Resources Classes Distributions on STS Web ODP 12,616 17 12,286 243 DDC 27,299 10 27,040 99 User Behavior Books on STS LCC 24,861 20 23,565 204 Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 31 / 98
  • 32. STS & Datasets PhD Thesis Retrieval of Additional User Annotations Arkaitz Zubiaga Motivation Selection of a Classifier Delicious: 300,571,231 bookmarks STS & → 273,478,137 annotated (91.00%) Datasets LibraryThing: 44,612,784 bookmarks Representing the Aggregation of Tags → 22,343,427 annotated (50.08%) Tag Distributions on STS GoodReads: 47,302,861 bookmarks User Behavior → 9,323,539 annotated (19.71%) on STS Conclusions & Importance of system’s encouragement to tagging Outlook Publications resources. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 32 / 98
  • 33. STS & Datasets PhD Thesis Tag Popularity on Resources Arkaitz Zubiaga Motivation Selection of a 1 Classifier 0.9 STS & Delicious Datasets Representing 0.8 LibraryThing the Aggregation of 0.7 GoodReads Average usage Tags 0.6 Tag Distributions 0.5 on STS User Behavior 0.4 on STS 0.3 Conclusions & Outlook 0.2 Publications 0.1 0 Tag rank on resources Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 33 / 98
  • 34. STS & Datasets PhD Thesis Tag Novelty in Bookmarks by Rank Arkaitz Zubiaga 100% Motivation Selection of a Classifier 90% Delicious STS & 80% LibraryThing Datasets 70% Representing the Aggregation of 60% % of novelty Tags 50% Tag Distributions 40% on STS User Behavior 30% on STS Conclusions & 20% Outlook 10% Publications 0% 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Bookmark rank Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 34 / 98
  • 35. STS & Datasets PhD Thesis Retrieval of Additional Data Arkaitz Zubiaga Motivation Selection of a Classifier URLs: STS & Datasets Self-content, by crawling URLs. Representing User reviews (Delicious & StumbleUpon). the Aggregation of Tags Books: Tag Distributions Self-content (unavailable): on STS Synopses (Barnes&Noble). User Behavior Editorial reviews (Amazon). on STS User reviews (LibraryThing, GoodReads & Amazon). Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 35 / 98
  • 36. STS & Datasets PhD Thesis Summary of the Analysis of Datasets Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Few users annotate resources when the system does not encourage to do it. Representing the Aggregation of Tags Tag Resource-based tag suggestions → Repeated use of Distributions on STS popular tags. User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 36 / 98
  • 37. Representing the Aggregation of Tags PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 37 / 98
  • 38. Representing the Aggregation of Tags PhD Thesis Representing Resources Using Tags Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Different ways to aggregate user annotations on a Representing vectorial representation. the Aggregation of Tags Tag 2 major factors to consider: Distributions What tags to use? on STS How to weigh those tags? User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 38 / 98
  • 39. Representing the Aggregation of Tags PhD Thesis Representing Resources Using Tags Arkaitz Zubiaga Use of all tags (FTA), or just top 10 tags for each Motivation resource. Selection of a Classifier 4 different weightings. STS & Datasets Example of a resource (100 users): Representing t1 (50), t2 (30), t3 (20), ..., t9 (1), t10 (1), ..., tn (1) the Aggregation of Tags Tag FTA Top 10 Distributions on STS User Behavior t1 t2 t3 ... t9 t10 ... tn on STS Ranks 1 0.9 0.8 ... 0.2 0.1 ... 0 Conclusions & Outlook Fractions 0.5 0.3 0.2 ... 0.02 0.01 ... 0.01 Publications Binary 1 1 1 ... 1 1 ... 1 TF 50 30 20 ... 2 1 ... 1 Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 39 / 98
  • 40. Representing the Aggregation of Tags PhD Thesis Representing Resources Using Other Data Arkaitz Sources Zubiaga Motivation Selection of a Classifier STS & To represent resources using content and reviews: Datasets Representing 1 Removal of HTML tags. the Aggregation of Tags 2 Removal of stopwords. Tag Distributions on STS 3 Stem of remaining words. User Behavior on STS Conclusions & 4 TF-IDF weighting of words. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 40 / 98
  • 41. Representing the Aggregation of Tags PhD Thesis Experiment Setup Arkaitz Zubiaga Motivation Selection of a Classifier Multiclass SVMs. STS & Datasets Representing Show the average accuracy of 6 runs. the Aggregation of Tags For clarity of presentation, we limit results to: Tag Distributions LCC taxonomy for books. on STS Training sets of 6,000 URLs (6,616 (L1)/6,286 (L2) for User Behavior test). Training sets of 18,000 books (8,861 (L1)/5,565 (L2) for on STS Conclusions & Outlook test). Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 41 / 98
  • 42. Representing the Aggregation of Tags PhD Thesis Experiment Setup Arkaitz Zubiaga Motivation Selection of a Classifier Compared Representations STS & Self-content (baseline). Datasets Representing the Reviews. Aggregation of Tags Tag Tags: Distributions on STS Ranks (Top 10). User Behavior Fractions (Top 10 & FTA). on STS Binary (Top 10 & FTA). Conclusions & Outlook TF (Top 10 & FTA). Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 42 / 98
  • 43. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Delicious LThing GReads Selection of a L1 L2 L1 L2 L1 L2 Classifier (17) (243) (20) (204) (20) (204) STS & Datasets Content .610 .470 .807 .673 .807 .673 Representing Reviews .646 .524 .828 .705 .828 .705 Ranks the Aggregation of .484 .360 .795 .511 .630 .405 Fractions (10) Tags .464 .349 .738 .411 .663 .427 Tag Fractions (FTA) .461 .336 .712 .409 .654 .432 Tags Distributions Binary (10) on STS .531 .361 .770 .550 .623 .422 User Behavior on STS Binary (FTA) .572 .529 .655 .606 .639 .481 Conclusions & TF (10) .654 .545 .855 .722 .713 .491 Outlook Publications TF (FTA) .680 .568 .857 .736 .731 .517 Usually, FTA > 10. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 43 / 98
  • 44. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Delicious LThing GReads Selection of a L1 L2 L1 L2 L1 L2 Classifier (17) (243) (20) (204) (20) (204) STS & Datasets Content .610 .470 .807 .673 .807 .673 Representing Reviews .646 .524 .828 .705 .828 .705 Ranks the Aggregation of .484 .360 .795 .511 .630 .405 Tags Fractions (10) .464 .349 .738 .411 .663 .427 Fractions (FTA) Tag .461 .336 .712 .409 .654 .432 Tags Distributions Binary (10) on STS .531 .361 .770 .550 .623 .422 User Behavior on STS Binary (FTA) .572 .529 .655 .606 .639 .481 Conclusions & TF (10) .654 .545 .855 .722 .713 .491 Outlook TF (FTA) .680 .568 .857 .736 .731 .517 Publications TF (FTA) is the best approach for tags. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 44 / 98
  • 45. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Selection of a Classifier Delicious LThing GReads STS & Datasets L1 L2 L1 L2 L1 L2 Representing (17) (243) (20) (204) (20) (204) the Aggregation of Content .610 .470 .807 .673 .807 .673 Tags Reviews .646 .524 .828 .705 .828 .705 Tag Distributions Tags .680 .568 .857 .736 .731 .517 on STS User Behavior on STS Tags clearly outperform content and reviews on Conclusions & Outlook Delicious and LibraryThing. Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 45 / 98
  • 46. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Selection of a Classifier Delicious LThing GReads STS & Datasets L1 L2 L1 L2 L1 L2 Representing (17) (243) (20) (204) (20) (204) the Aggregation of Content .610 .470 .807 .673 .807 .673 Tags Reviews .646 .524 .828 .705 .828 .705 Tag Distributions Tags .680 .568 .857 .736 .731 .517 on STS User Behavior on STS GoodReads’ disencouragement to tagging makes it Conclusions & Outlook insufficient to outperform content and reviews. Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 46 / 98
  • 47. Representing the Aggregation of Tags PhD Thesis Results of Tags vs Other Data Sources Arkaitz Zubiaga Motivation Selection of a Classifier STS & Delicious LThing GReads Datasets L1 L2 L1 L2 L1 L2 Representing the (17) (243) (20) (204) (20) (204) Aggregation of Tags Content .610 .470 .807 .673 .807 .673 Tag Reviews .646 .524 .828 .705 .828 .705 Distributions on STS Tags .680 .568 .857 .736 .731 .517 User Behavior on STS Conclusions & Tags are also useful for deeper categorization (L2). Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 47 / 98
  • 48. Representing the Aggregation of Tags PhD Thesis Classifier Committees Arkaitz Zubiaga Despite the superiority of social tags, all data sources Motivation perform well. Selection of a Classifier STS & Their outputs can be combined by using classifier Datasets committees. Representing the Aggregation of Tags Classifier committees add up margins (i.e., reliability Tag values) outputted by several classifiers, and provide a Distributions on STS single combined prediction. User Behavior on STS Conclusions & Cat. #1 Cat. #2 Cat. #3 Outlook Classif. A 1.2 1.1 0.6 Publications Classif. B 0.5 1.0 1.2 Classif. committees 1.7 2.1 1.8 Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 48 / 98
  • 49. Representing the Aggregation of Tags PhD Thesis Results of Classifier Committees Arkaitz Zubiaga Delicious LThing GReads Motivation L1 L2 L1 L2 L1 L2 Selection of a Classifier (17) (243) (20) (204) (20) (204) STS & Content (C) .610 .470 .807 .673 .807 .673 Datasets Reviews (R) .646 .524 .828 .705 .828 .705 Representing the Tags (T) .680 .568 .857 .736 .731 .517 Aggregation of C+R .670 .547 .817 .704 .817 .704 Commit. Tags Tag Distributions C+T .696 .587 .821 .720 .832 .696 on STS R+T .694 .584 .859 .755 .857 .730 User Behavior on STS C+R+T .699 .588 .827 .732 .843 .727 Conclusions & Outlook Classifier committees successfully improve performance. Publications Even on GoodReads, where tags were not good enough on their own. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 49 / 98
  • 50. Representing the Aggregation of Tags PhD Thesis Results of Classifier Committees Arkaitz Zubiaga Delicious LThing GReads Motivation L1 L2 L1 L2 L1 L2 Selection of a Classifier (17) (243) (20) (204) (20) (204) STS & Content (C) .610 .470 .807 .673 .807 .673 Reviews (R) Datasets .646 .524 .828 .705 .828 .705 Representing the Tags (T) .680 .568 .857 .736 .731 .517 Aggregation of C+R .670 .547 .817 .704 .817 .704 Commit. Tags Tag C+T .696 .587 .821 .720 .832 .696 Distributions on STS R+T .694 .584 .859 .755 .857 .730 User Behavior C+R+T .699 .588 .827 .732 .843 .727 on STS Conclusions & Outlook Data sources must be chosen with care: Publications All 3 are helpful on Delicious. Content is harmful for books. Inappropriate considering synopses and ed. reviews as a summary of content? Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 50 / 98
  • 51. Representing the Aggregation of Tags PhD Thesis Summary of Results Arkaitz Zubiaga Motivation Selection of a Classifier Better represent using all tags with TF weighting. STS & Datasets Representing Tags perform accurately even for deeper levels. the Aggregation of The system must encourage the user to tag to make it Tags useful enough. Tag Distributions on STS Tags can be combined with other data to improve User Behavior on STS performance. Conclusions & Combined data sources must be chosen with care. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 51 / 98
  • 52. Tag Distributions on STS PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 52 / 98
  • 53. Tag Distributions on STS PhD Thesis Tag Distributions Arkaitz Zubiaga Motivation Selection of a Classifier So far, we have considered that tags annotated by the STS & Datasets Representing same number of users are equally representative to the Aggregation of the resource. Tags Tag Distributions on STS Distributions of tags in a collection could help User Behavior determine representativity of tags. on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 53 / 98
  • 54. Tag Distributions on STS PhD Thesis TF-IDF Arkaitz Zubiaga Motivation Selection of a TF-IDF is an inverse weighting function (IWF) that Classifier computes: STS & Datasets the term frequency (TF). Representing the the inverse document frequency (IDF). Aggregation of Tags Tag |D| Distributions tf -idfij = tfij × log on STS |{d : ti ∈ d}| User Behavior on STS High IDF value for terms appearing in few documents. Low IDF value for terms appearing in many documents. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 54 / 98
  • 55. Tag Distributions on STS PhD Thesis Tag Weighting Functions Arkaitz Zubiaga Motivation Selection of a Classifier Analogous to TF-IDF on folksonomies: STS & Datasets TF-IRF → distributions across resources. Representing TF-IUF → distributions across users. the TF-IBF → distributions across bookmarks. Aggregation of Tags Tag Distributions TF-IRF and TF-IUF had been barely used, and their on STS suitability was yet unexplored. User Behavior on STS Conclusions & TF-IBF had not been used. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 55 / 98
  • 56. Tag Distributions on STS PhD Thesis Results Using IWFs Arkaitz Zubiaga Motivation Selection of a Classifier Delicious LThing GReads STS & L1 L2 L1 L2 L1 L2 Datasets TF .680 .568 .857 .736 .731 .517 Representing the TF-IRF .639 .529 .894 .809 .799 .622 IWFs Aggregation of Tags TF-IBF .641 .532 .895 .811 .800 .628 Tag TF-IUF .661 .555 .892 .803 .794 .623 Distributions on STS All 3 IWFs clearly outperform TF for LibraryThing and User Behavior on STS Conclusions & GoodReads. Similar performance of IWFs. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 56 / 98
  • 57. Tag Distributions on STS PhD Thesis Results Using IWFs Arkaitz Zubiaga Motivation Selection of a Delicious LThing GReads Classifier STS & L1 L2 L1 L2 L1 L2 Datasets TF .680 .568 .857 .736 .731 .517 Representing the TF-IRF .639 .529 .894 .809 .799 .622 IWFs Aggregation of Tags TF-IBF .641 .532 .895 .811 .800 .628 Tag TF-IUF .661 .555 .892 .803 .794 .623 Distributions on STS User Behavior on STS IWFs underperform on Delicious, due to tag suggestions that make top tags utmost popular. Conclusions & Outlook IUF superior to IBF and IRF. Users who make their Publications own choices make the difference. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 57 / 98
  • 58. Tag Distributions on STS PhD Thesis Using Classifier Committees with IWFs Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the How about using tags represented with IWFs on classifier Aggregation of Tags committees? Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 58 / 98
  • 59. Tag Distributions on STS PhD Thesis Results Using IWF with Committees Arkaitz Zubiaga Motivation Selection of a Delicious LThing GReads Classifier STS & L1 L2 L1 L2 L1 L2 Datasets TF .699 .588 .859 .755 .857 .730 Representing the TF-IRF .697 .592 .885 .793 .864 .748 IWFs Aggregation of Tags TF-IBF .698 .592 .887 .797 .866 .751 Tag TF-IUF .700 .595 .885 .792 .864 .749 Distributions on STS User Behavior IWF-based committes are even better than TF-based ones. on STS Conclusions & Outlook Even on Delicious, where IWFs were not appropriate, Publications committees perform slightly better. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 59 / 98
  • 60. Tag Distributions on STS PhD Thesis Results Using IWF with Committees Arkaitz Zubiaga Motivation Selection of a Classifier Delicious LThing GReads STS & L1 L2 L1 L2 L1 L2 TF Datasets .699 .588 .859 .755 .857 .730 Representing the TF-IRF .697 .592 .885 .793 .864 .748 IWFs Aggregation of Tags TF-IBF .698 .592 .887 .797 .866 .751 Tag TF-IUF .700 .595 .885 .792 .864 .749 Distributions on STS User Behavior on STS Despite this outperformance of IWFs using committees, Conclusions & IWFs on their own perform better on LibraryThing (.895 & .811). Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 60 / 98
  • 61. Tag Distributions on STS PhD Thesis Summary of Results Using IWFs Arkaitz Zubiaga Motivation Selection of a Classifier STS & IWFs are an appropriate way to weight tags when used on classifier committees. Datasets Representing the The exception is LibraryThing, where tags on their own Aggregation of Tags perform better. Tag Distributions on STS Combined data sources must be appropriately chosen User Behavior (e.g., synopses & ed. reviews are harmful with books). on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 61 / 98
  • 62. User Behavior on STS PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 62 / 98
  • 63. User Behavior on STS PhD Thesis User Behavior: Categorizers and Describers Arkaitz Zubiaga K¨rner1 suggested 2 kinds of user behavior: Motivation o Selection of a Classifier STS & Categorizer Describer Datasets Goal of Tagging later browsing later retrieval Representing the Change of Tag Vocabulary costly cheap Aggregation of Size of Tag Vocabulary limited open Tags Tags subjective objective Tag Distributions on STS User Behavior They found that Describers help infer semantic on STS relations among tags. Conclusions & Outlook Do these tagging behaviors affect the usefulness of tags Publications for resource classification? 1 C. K¨rner. Understanding the Motivation behind Tagging. Hypertext 2009. o Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 63 / 98
  • 64. User Behavior on STS PhD Thesis Categorizers and Describers Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 64 / 98
  • 65. User Behavior on STS PhD Thesis Categorizers and Describers Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 65 / 98
  • 66. User Behavior on STS PhD Thesis Weighting Measures Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing We use 3 measures to weight users, based on Koerner et al. the Aggregation of (2010). Tags Tag 2 factors are considered: verbosity & diversity. Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 66 / 98
  • 67. User Behavior on STS PhD Thesis Weighting Measures: TPP Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Tags per Post (TPP) – Verbosity Representing the Aggregation of r Tags |Tur | Tag TPP(u) = Distributions |Ru | on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 67 / 98
  • 68. User Behavior on STS PhD Thesis Weighting Measures: ORPHAN Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Orphan Ratio (ORPHAN) – Diversity Representing the |R(tmax )| Aggregation of n= Tags 100 Tag Distributions o |Tu | o on STS ORPHAN(u) = , T = {t||R(t)| ≤ n} User Behavior |Tu | u on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 68 / 98
  • 69. User Behavior on STS PhD Thesis Weighting Measures: TRR Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing Tag Resource Ratio (TRR) – Verbosity + Diversity the Aggregation of Tags |Tu | TRR(u) = Tag |Ru | Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 69 / 98
  • 70. User Behavior on STS PhD Thesis Use of Weighting Measures Arkaitz Zubiaga Motivation Selection of a Classifier These 3 measures provide: STS & Datasets A weight for each user. Ranking of users according to each measure. Representing the Aggregation of Tags From rankings → subsets of users as extreme Tag Distributions Categorizers (highest-ranked) and extreme Describers on STS (lowest-ranked). User Behavior on STS Conclusions & Subsets range from 10% to 100% (step size = 10%). Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 70 / 98
  • 71. User Behavior on STS PhD Thesis Experiment Setup Arkaitz Zubiaga We select subsets of users according to number of tag Motivation assignments. Selection of a Classifier STS & Selecting by percents of users would be unfair → Datasets different amounts of data. Representing the Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 71 / 98
  • 72. User Behavior on STS PhD Thesis Experiments Arkaitz Zubiaga Motivation Classification Selection of a We use a multiclass SVM, with TF weighting of tags. Classifier STS & Descriptivity Datasets Vectorial representations of resources: Tr → tag frequencies. Representing the Aggregation of Tags Rr → term frequencies on descriptive data Tag (self-content). Distributions on STS User Behavior Cosine similarity between Tr and Rr : on STS Conclusions & n Outlook Tri × Rri cos(θr ) = n 2 n 2 Publications i=1 i=1 (Tri ) × i=1 (Rri ) Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 72 / 98
  • 73. User Behavior on STS PhD Thesis Descriptivity Results Arkaitz Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.) Motivation Selection of a Delicious Classifier STS & Datasets Representing the LibraryThing Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & GoodReads Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 73 / 98
  • 74. User Behavior on STS PhD Thesis Classification Results Arkaitz Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.) Motivation Delicious Selection of a Classifier STS & Datasets Representing the LibraryThing Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & GoodReads Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 74 / 98
  • 75. User Behavior on STS PhD Thesis Classification Results Arkaitz Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.) Motivation Delicious Selection of a Classifier STS & Datasets Representing the LibraryThing Aggregation of Tags Tag Distributions on STS User Behavior on STS Conclusions & GoodReads Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 75 / 98
  • 76. User Behavior on STS PhD Thesis Overall Categorizers/Describers Results Arkaitz Zubiaga Motivation Selection of a Classifier STS & Discriminating by verbosity (TPP) does best for finding Datasets Representing the extreme Categorizers. Aggregation of Tags Tag Distributions The use of non-descriptive tags provide more accurate on STS classification. User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 76 / 98
  • 77. Conclusions & Outlook PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 77 / 98
  • 78. Conclusions & Outlook PhD Thesis Contributions Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Generation & analysis of 3 large-scale social tagging Representing datasets. the Aggregation of Release of some tagging datasets, used by Godoy and Tags Tag Distributions Amandi (2010), Strohmaier et al. (2010), Li et al. (2011), on STS and Ares et al. (2011). User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 78 / 98
  • 79. Conclusions & Outlook PhD Thesis Contributions Arkaitz Zubiaga Motivation Selection of a Classifier First research work performing actual classification STS & Datasets experiments using social tags. Representing Analysis of different representations of social tags. the Aggregation of Analysis of effect of tag distributions. Tags Study of user behavior. Tag Distributions It paves the way to future researchers interested in the on STS User Behavior on STS task & in the exploration of STS. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 79 / 98
  • 80. Conclusions & Outlook PhD Thesis Research Questions Arkaitz Zubiaga Motivation Selection of a Classifier Apart from the Problem Statement: STS & Datasets Representing How can the annotations provided by users on social the Aggregation of tagging systems be exploited to improve the accuracy of Tags a resource classification task? Tag Distributions on STS We set forth 10 research questions. User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 80 / 98
  • 81. Conclusions & Outlook PhD Thesis Research Questions (1) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing RQ 1 What is a suitable SVM classifier for the task? the Aggregation of Tags Native multiclass SVM >> Combinations of Tag Distributions binary SVMs. on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 81 / 98
  • 82. Conclusions & Outlook PhD Thesis Research Questions (2) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 2 What is a suitable learning method for the Datasets task? Representing the Aggregation of Supervised Semi-supervised. Tags Tag Distributions Unlike for binary tasks, where on STS Semi-supervised >> Supervised (Joachims, User Behavior on STS 1999). Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 82 / 98
  • 83. Conclusions & Outlook PhD Thesis Research Questions (3) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets RQ 3 How do the settings of STS affect Representing folksonomies? the Aggregation of Tags Great impact of tag suggestions. Tag Distributions on STS Importance of encouraging users to User Behavior on STS annotate. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 83 / 98
  • 84. Conclusions & Outlook PhD Thesis Research Questions (4) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 4 How to amalgamate annotations to get a Datasets representation of a resource? Representing the Aggregation of Tags Considering all the tags rather than only Tag those in the top. Distributions on STS User Behavior Weighting tags according to number of on STS users annotating them. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 84 / 98
  • 85. Conclusions & Outlook PhD Thesis Research Questions (5) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 5 Is it worthwhile combining tags with other Datasets data sources? Representing the Aggregation of Tags Combining different data sources helps Tag improve performance. Distributions on STS User Behavior Data sources must be appropriately on STS chosen. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 85 / 98
  • 86. Conclusions & Outlook PhD Thesis Research Questions (6) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets RQ 6 Are social tags specific enough to classify into Representing narrower categories? the Aggregation of Tags Tags are as useful as for top level. Tag Distributions on STS Noll and Meinel (2008) → tags were User Behavior on STS probably not useful for deeper levels. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 86 / 98
  • 87. Conclusions & Outlook PhD Thesis Research Questions (7) Arkaitz Zubiaga Motivation Selection of a Classifier RQ 7 Can we consider tag distributions to get the STS & Datasets representativity of each tag? Representing the Aggregation of LibraryThing & GoodReads: really useful. Tags Tag Distributions Delicious: not useful, because of tag suggestions on STS User Behavior on STS → need of committees to make them Conclusions & useful. Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 87 / 98
  • 88. Conclusions & Outlook PhD Thesis Research Questions (8) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 8 What approach to use to weigh the Datasets representativity of tags? Representing the Aggregation of Tags LibraryThing & GoodReads: IBF, IRF & Tag IUF are very similar. Distributions on STS User Behavior Delicious: IUF clearly superior, because of on STS users that get rid of suggestions. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 88 / 98
  • 89. Conclusions & Outlook PhD Thesis Research Questions (9) Arkaitz Zubiaga Motivation Selection of a Classifier STS & RQ 9 Can we discriminate users who further resemble Datasets an expert classification? Representing the Aggregation of Tags Categorizers > Describers for Tag classification. Distributions on STS User Behavior Need of appropriate measure for on STS discriminating. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 89 / 98
  • 90. Conclusions & Outlook PhD Thesis Research Questions (10) Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets RQ 10 What features identify a Categorizer? Representing the Aggregation of Categorizers can be found when Tags discriminating by verbosity. Tag Distributions on STS Non-descriptive tags produce more User Behavior on STS accurate classification. Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 90 / 98
  • 91. Conclusions & Outlook PhD Thesis Future Directions Arkaitz Zubiaga Motivation Selection of a Classifier Increase of interest in the field, still much work to do. STS & Datasets We have considered each tag as a diferent token. → Considering semantic meanings of social tags could Representing the Aggregation of Tags help. Tag Distributions on STS Tag suggestions leverage several issues in folksonomies. User Behavior → Looking for a weighting function that fits the on STS Conclusions & characteristics of systems with tag suggestions, e.g., Outlook Delicious. Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 91 / 98
  • 92. Publications PhD Thesis Index Arkaitz Zubiaga 1 Motivation Motivation Selection of a Classifier 2 Selection of a Classifier STS & Datasets 3 STS & Datasets Representing the Aggregation of Tags 4 Representing the Aggregation of Tags Tag Distributions 5 Tag Distributions on STS on STS User Behavior on STS 6 User Behavior on STS Conclusions & Outlook 7 Conclusions & Outlook Publications 8 Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 92 / 98
  • 93. Publications PhD Thesis Publications Arkaitz Zubiaga Motivation Peer-Reviewed Conferences (I) Selection of a Arkaitz Zubiaga, Christian K¨rner, Markus Strohmaier. o Classifier 2011. Tags vs Shelves: From Social Tagging to Social STS & Datasets Classification. In Proceedings of Hypertext 2011, the Representing 22nd ACM Conference on Hypertext and Hypermedia, the Aggregation of Eindhoven, Netherlands. (acceptance rate: 35/104, 34%) Tags Tag Distributions on STS Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2009. User Behavior Getting the Most Out of Social Annotations for Web Page on STS Classification. In Proceedings of DocEng 2009, the 9th Conclusions & Outlook ACM Symposium on Document Engineering, pp. 74-83, Publications Munich, Germany. (acceptance rate: 16/54, 29.6%) [15 citations] Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 93 / 98
  • 94. Publications PhD Thesis Publications Arkaitz Zubiaga Motivation Peer-Reviewed Conferences (II) Selection of a Classifier Arkaitz Zubiaga. 2009. Enhancing Navigation on STS & Wikipedia with Social Tags. Wikimania 2009, Buenos Datasets Aires, Argentina. Representing the [6 citations] Aggregation of Tags Tag Arkaitz Zubiaga, Alberto P. Garc´ ıa-Plaza, V´ ıctor Fresno, Distributions on STS Raquel Mart´ ınez. 2009. Content-based Clustering for Tag User Behavior Cloud Visualization. In Proceedings of ASONAM 2009, on STS Conclusions & International Conference on Advances in Social Networks Outlook Analysis and Mining, pp. 316-319, Athens, Greece. Publications [3 citations] Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 94 / 98
  • 95. Publications PhD Thesis Publications Arkaitz Zubiaga Journals (I) Motivation Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2011. Selection of a Classifier Augmenting Web Page Classifiers with Social Annotations. STS & Procesamiento del Lenguaje Natural. (acceptance rate: Datasets 33/60, 55%) Representing the Aggregation of Tags Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2009. Tag Clasificaci´n de P´ginas Web con Anotaciones Sociales. o a Distributions Procesamiento del Lenguaje Natural, vol. 43, pp. 225-233. on STS (acceptance rate: 36/72, 50%) User Behavior on STS Conclusions & Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´ ınez. 2009. Outlook Comparativa de Aproximaciones a SVM Semisupervisado Publications Multiclase para Clasificaci´n de P´ginas Web. Procesamiento o a del Lenguaje Natural, vol. 42, pp. 63-70. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 95 / 98
  • 96. Publications PhD Thesis Publications Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Journals (II) Representing the Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ ınez. Harnessing Aggregation of Tags Folksonomies to Produce a Social Classification of Resources. IEEE Transactions on Knowledge and Data Engineering. Tag Distributions (pending notification) on STS User Behavior on STS Conclusions & Outlook Publications Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 96 / 98
  • 97. Publications PhD Thesis Publications Arkaitz Zubiaga Motivation Book Chapters Selection of a Classifier Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´ ınez. 2011. STS & Exploiting Social Annotations for Resource Classification. Datasets Social Network Mining, Analysis and Research Representing the Trends: Techniques and Applications. IGI Global. Aggregation of Tags Workshops Tag Distributions Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ınez. 2009. Is on STS Unlabeled Data Suitable for Multiclass SVM-based Web User Behavior on STS Page Classification?. In Proceedings of the NAACL-HLT Conclusions & 2009 Workshop on Semi-supervised Learning for Natural Language Processing, pp. 28-36, Boulder, CO, Outlook Publications United States. Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 97 / 98
  • 98. Publications PhD Thesis Thank You Arkaitz Zubiaga Motivation Selection of a Achiu Arigato Danke Dhannvaad Dua Netjer en ek Classifier Efcharisto Gracias Gr`cies Gratia Grazie a Guishepeli Hvala Kiitos K¨sz¨n¨m o o o STS & Datasets Representing Merc´ Merci Mila esker Obrigado the Aggregation of Tags e Tag Distributions Shukran Tack Tak Takk T¨anan Tapadh leat Shukriya on STS User Behavior on STS Tesekk¨r ederim Thank you Toda u Conclusions & Outlook Publications http://thesis.zubiaga.org/ Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 98 / 98