Harnessing Folksonomies for Resource Classification

Harnessing Folksonomies for Resource
Classiﬁcation
PhD Thesis

Arkaitz Zubiaga

UNED

July 12th, 2011

Advisors:
Raquel Mart´ınez Unanue
V´
ıctor Fresno Fern´ndez
a

PhD Thesis Table of Contents
Arkaitz
Zubiaga
1 Motivation
Motivation

Selection of a
Classiﬁer 2 Selection of a Classiﬁer
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
4 Representing the Aggregation of Tags
Tag
Distributions 5 Tag Distributions on STS
on STS

User Behavior
on STS 6 User Behavior on STS
Conclusions &
Outlook
7 Conclusions & Outlook
Publications

8 Publications

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 2 / 98

Motivation

PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation

Selection of a
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
Tag
on STS

User Behavior
Conclusions &
Outlook
Publications

8 Publications


Motivation

PhD Thesis Resource Classiﬁcation
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications


Motivation

Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications


Motivation

Arkaitz
Zubiaga

Motivation

Selection of a
Classifier Classifying resources is a common task.
STS &
Datasets
Web pages, books, movies, files,...
Representing
the
Aggregation of
Large collections of resources → expensive & effortful
Tags to classify manually.
Tag
Distributions
LoC reported an average cost of $94.58 for cataloging
on STS each book in 2002.
User Behavior
on STS

Conclusions &
Enormous costs and efforts → automatic classification.
Outlook

Publications


Motivation

Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer
Representation of resources → self-content.
STS &
Datasets

Representing Use of self-content of resources presents some issues:
the
Aggregation of Not always representative enough.
Tags
Not always accessible (e.g., books).
Tag
Distributions
on STS

User Behavior
Social tags provided by users → alternative to solve the
on STS problem.
Conclusions &
Outlook

Publications


Motivation

PhD Thesis Tagging
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications

T1, T2, T3 = sets of tags.


Motivation

PhD Thesis Social Tagging
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications Aggregation of user annotations → folksonomy.
Folksonomy: Folk (People) + Taxis (Classiﬁcation) +
Nomos (Management).


Motivation

PhD Thesis Organization of Resources
Arkaitz
Zubiaga

Motivation

Selection of a User annotations → own organization of resources.
Classiﬁer

STS &
Datasets A user’s tags
Representing
the Tag # Resources
Aggregation of
Tags research 82
Tag twitter 28
Distributions
on STS web2.0 35
User Behavior language 42
on STS
english 64
Conclusions &
Outlook ... ...
Publications


Motivation

PhD Thesis Example of Bookmarks
Arkaitz
Zubiaga

Motivation User Resource Tags
Selection of a 1 user1 flickr.com photo, web2.0, social
Classifier
2 user2 flickr.com photography, images
STS &
Datasets 3 user1 google.com searchengine
Representing 4 user3 twitter.com microblogging, twitter
the
Aggregation of
Tags

Tag Bookmark: (1) user ui ∈ U who annotates
(2) resource rj ∈ R being annotated
Distributions
on STS

User Behavior (3) tags Tij = {t1 , ..., tn } ∈ T utilized.
on STS

Conclusions &
Outlook

Publications

bij : ui × rj × Tij


Motivation

PhD Thesis Sum of Annotations
Arkaitz
Zubiaga

Motivation Top tags (79,681 users)
Selection of a
Classiﬁer
Tag Rank Tag User Count
STS & 1 photos 22,712
Datasets
2 flickr 19,046
Representing
the 3 photography 15,968
Aggregation of
Tags 4 photo 15,225
Tag 5 sharing 10,648
Distributions
on STS 6 images 9,637
User Behavior 7 web2.0 9,528
on STS

Conclusions &
8 community 4,571
Outlook 9 social 3,798
Publications
10 pictures 3,115


Motivation

PhD Thesis Tag-based Resource Classiﬁcation
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications


Motivation

PhD Thesis Problem Statement
Arkaitz
Zubiaga

Motivation
How can the annotations provided by users on social tagging
Selection of a
systems be exploited to improve the accuracy of a resource
Classiﬁer
classiﬁcation task?
STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications


Motivation

PhD Thesis Related Work
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Social tags for information management:
Datasets
Search: Bao et al. (2007) & Heymann et al. (2008).
Representing
the

Recommender Systems: Shepitsen et al. (2008) & Li
Aggregation of
Tags

Tag et al. (2008).
Distributions
on STS

User Behavior
on STS
Enhanced Browsing: Smith (2008).
Conclusions &
Outlook

Publications


Motivation

PhD Thesis Related Work
Arkaitz
Zubiaga

Motivation

Selection of a
Classifier Classification: Noll and Meinel (2008) → statistical
STS &
Datasets
analysis of matches between tags & taxonomies.
Representing
Tags are useful for broad categorization.
the Not for narrower categorization.
Aggregation of
Tags

Tag
Distributions
Lack of further research with:
on STS Actual classification experiments.
User Behavior Other types of resources.
on STS
Different representations of social tags.
Conclusions &
Outlook

Publications


Selection of a Classiﬁer

PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation

Selection of a
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
Tag
on STS

User Behavior
Conclusions &
Outlook
Publications

8 Publications



PhD Thesis Characteristics of the task
Arkaitz
Zubiaga

Motivation We have:
Selection of a
Classiﬁer
Large set of resources: some labeled + many unlabeled.
STS &
Multiclass taxonomy.
Datasets

Representing
the
Automated classiﬁers learn a model from labeled
Aggregation of resources.
Tags

Tag
This model is used to classify unlabeled resources
Distributions afterward.
on STS

User Behavior
on STS 2 learning settings:
Conclusions &
Outlook
Supervised: only labeled resources considered for learning.
Semi-supervised: unlabeled resources are also taken into
Publications
account.



PhD Thesis Support Vector Machines (SVM)
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions & Hyperplane that separates with largest margin.
Outlook

Publications
Use of kernels → redimensions the space.

Resource/Hyperplane margin → Classiﬁer’s reliability.


PhD Thesis Selection of a Classifier
Arkaitz
Zubiaga

Motivation

Selection of a
Classifier SVMs solve binary problems by default.
STS &
Datasets

Representing To solve multiclass tasks:
the
Aggregation of Native multiclass classifier (mSVM).
Tags Combining binary classifiers:
Tag one-against-all (oaaSVM).
Distributions
on STS one-against-one (oaoSVM).
User Behavior
on STS

Conclusions &
Both supervised (s) and semi-supervised (ss).
Outlook

Publications



PhD Thesis Experiment Settings
Arkaitz
Zubiaga

Motivation

Selection of a
3 benchmark datasets to analyze suitability of
Classiﬁer classiﬁers:
STS &
Datasets

Representing Dataset # web pages # trainset # categories
BankSearch
the
Aggregation of 10,000 3,000 10
WebKB
Tags
4,518 1,000 6
Tag
Distributions Y! Science 788 100 6
on STS

User Behavior
on STS
We present accuracy to show performance.
Conclusions &
Outlook

Publications We perform 6 runs, and show the average accuracy.



PhD Thesis Results
Arkaitz
Zubiaga

Motivation BankSearch WebKB Y! Science
Selection of a mSVM (s) .925 .810 .825
Classiﬁer
mSVM (ss) .923 .778 .836
STS &
Datasets oaaSVM (s) .843 .776 .536
Representing
the
oaaSVM (ss) .842 .773 .565
Aggregation of
Tags
oaoSVM (s) .826 .775 .483
Tag
oaoSVM (ss) .811 .754 .514
Distributions
on STS

User Behavior Native multiclass classiﬁer performs best, while
on STS

Conclusions &
supervised semi-supervised.
Outlook

Publications
We used the supervised approach, as it is
computationally less expensive.


STS & Datasets

PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation

Selection of a
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
Tag
on STS

User Behavior
Conclusions &
Outlook
Publications

8 Publications


STS & Datasets

PhD Thesis Requirements
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets Selected STS should have:
Representing Large communities involved.
the
Aggregation of Public access to data.
Tags
Consolidated taxonomies as a ground truth.
Tag
Distributions

We chose Delicious, LibraryThing & GoodReads.
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications


STS & Datasets

PhD Thesis Characteristics of STS
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS & Delicious LibraryThing GoodReads
Datasets
Resources web documents books books
Representing
the Tag suggestions based on earlier no based on earlier
Aggregation of bookmarks on the tags utilized by
Tags
resource the user
Tag
Distributions Tag insertion space-separated comma-separated one by one text-
on STS box
User Behavior Saving a resource prompts user to prompts user to user needs to
on STS
add tags add tags at sec- click again to add
Conclusions & ond step tags
Outlook

Publications


STS & Datasets

Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

Datasets
Representing
Tags
resource the user
Tag
on STS box
on STS
Outlook

Publications


STS & Datasets

PhD Thesis Retrieval of Categorized Resources
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer Retrieval of popular annotated resources, which were also
STS & categorized by experts.
Datasets

Representing
the
Aggregation of Top level (L1) Second level (L2)
Tags

Tag
Resources Classes Resources Classes
Distributions
on STS
Web ODP 12,616 17 12,286 243
DDC 27,299 10 27,040 99
User Behavior
Books
on STS
LCC 24,861 20 23,565 204
Conclusions &
Outlook

Publications


STS & Datasets

PhD Thesis Retrieval of Additional User Annotations
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer
Delicious: 300,571,231 bookmarks
STS &
→ 273,478,137 annotated (91.00%)
Datasets

LibraryThing: 44,612,784 bookmarks
Representing
the
Aggregation of
Tags → 22,343,427 annotated (50.08%)
Tag
Distributions
on STS GoodReads: 47,302,861 bookmarks
User Behavior → 9,323,539 annotated (19.71%)
on STS

Conclusions & Importance of system’s encouragement to tagging
Outlook

Publications
resources.


STS & Datasets

PhD Thesis Tag Popularity on Resources
Arkaitz
Zubiaga

Motivation

Selection of a 1
Classiﬁer
0.9
STS & Delicious
Datasets

Representing
0.8
LibraryThing
the
Aggregation of
0.7 GoodReads
Average usage

Tags
0.6
Tag
Distributions 0.5
on STS

User Behavior 0.4
on STS
0.3
Conclusions &
Outlook
0.2
Publications
0.1

0

Tag rank on resources

STS & Datasets

PhD Thesis Tag Novelty in Bookmarks by Rank
Arkaitz
Zubiaga

100%
Motivation

Selection of a
Classiﬁer
90% Delicious
STS & 80% LibraryThing
Datasets
70%
Representing
the
Aggregation of
60%
% of novelty

Tags
50%
Tag
Distributions
40%
on STS

User Behavior 30%
on STS

Conclusions & 20%
Outlook
10%
Publications
0%
3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

Bookmark rank


STS & Datasets

PhD Thesis Retrieval of Additional Data
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer URLs:
STS &
Datasets
Self-content, by crawling URLs.
Representing
User reviews (Delicious & StumbleUpon).
the
Aggregation of
Tags Books:
Tag
Distributions
Self-content (unavailable):
on STS Synopses (Barnes&Noble).
User Behavior Editorial reviews (Amazon).
on STS
User reviews (LibraryThing, GoodReads & Amazon).
Conclusions &
Outlook

Publications


STS & Datasets

PhD Thesis Summary of the Analysis of Datasets
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets
Few users annotate resources when the system does
not encourage to do it.
Representing
the
Aggregation of
Tags

Tag Resource-based tag suggestions → Repeated use of
Distributions
on STS popular tags.
User Behavior
on STS

Conclusions &
Outlook

Publications


Representing the Aggregation of Tags

PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation

Selection of a
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
Tag
on STS

User Behavior
Conclusions &
Outlook
Publications

8 Publications



PhD Thesis Representing Resources Using Tags
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets Diﬀerent ways to aggregate user annotations on a
Representing vectorial representation.
the
Aggregation of
Tags

Tag
2 major factors to consider:
Distributions What tags to use?
on STS
How to weigh those tags?
User Behavior
on STS

Conclusions &
Outlook

Publications



PhD Thesis Representing Resources Using Tags
Arkaitz
Zubiaga
Use of all tags (FTA), or just top 10 tags for each
Motivation
resource.
Selection of a
Classiﬁer 4 diﬀerent weightings.
STS &
Datasets Example of a resource (100 users):
Representing t1 (50), t2 (30), t3 (20), ..., t9 (1), t10 (1), ..., tn (1)
the
Aggregation of
Tags

Tag FTA
Top 10
Distributions
on STS

User Behavior t1 t2 t3 ... t9 t10 ... tn
on STS
Ranks 1 0.9 0.8 ... 0.2 0.1 ... 0
Conclusions &
Outlook Fractions 0.5 0.3 0.2 ... 0.02 0.01 ... 0.01
Publications Binary 1 1 1 ... 1 1 ... 1
TF 50 30 20 ... 2 1 ... 1



PhD Thesis Representing Resources Using Other Data
Arkaitz Sources
Zubiaga

Motivation

Selection of a
Classiﬁer

STS & To represent resources using content and reviews:
Datasets

Representing
1 Removal of HTML tags.
the
Aggregation of
Tags 2 Removal of stopwords.
Tag
Distributions
on STS 3 Stem of remaining words.
User Behavior
on STS

Conclusions &
4 TF-IDF weighting of words.
Outlook

Publications



PhD Thesis Experiment Setup
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer Multiclass SVMs.
STS &
Datasets

Representing
Show the average accuracy of 6 runs.
the
Aggregation of
Tags For clarity of presentation, we limit results to:
Tag
Distributions
LCC taxonomy for books.
on STS Training sets of 6,000 URLs (6,616 (L1)/6,286 (L2) for
User Behavior test).
Training sets of 18,000 books (8,861 (L1)/5,565 (L2) for
on STS

Conclusions &
Outlook test).
Publications



Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer
Compared Representations
STS & Self-content (baseline).
Datasets

Representing
the Reviews.
Aggregation of
Tags

Tag Tags:
Distributions
on STS Ranks (Top 10).
User Behavior Fractions (Top 10 & FTA).
on STS
Binary (Top 10 & FTA).
Conclusions &
Outlook TF (Top 10 & FTA).
Publications



PhD Thesis Results of Tags vs Other Data Sources
Arkaitz
Zubiaga

Motivation
Delicious LThing GReads
Selection of a
L1 L2 L1 L2 L1 L2
Classiﬁer
(17) (243) (20) (204) (20) (204)
STS &
Datasets Content .610 .470 .807 .673 .807 .673
Representing Reviews .646 .524 .828 .705 .828 .705
Ranks
the
Aggregation of .484 .360 .795 .511 .630 .405
Fractions (10)
Tags
.464 .349 .738 .411 .663 .427
Tag
Fractions (FTA) .461 .336 .712 .409 .654 .432
Tags

Distributions

Binary (10)
on STS
.531 .361 .770 .550 .623 .422
User Behavior
on STS Binary (FTA) .572 .529 .655 .606 .639 .481
Conclusions & TF (10) .654 .545 .855 .722 .713 .491
Outlook

Publications
TF (FTA) .680 .568 .857 .736 .731 .517

Usually, FTA > 10.



Arkaitz
Zubiaga

Motivation
Selection of a
L1 L2 L1 L2 L1 L2
Classiﬁer (17) (243) (20) (204) (20) (204)
STS &
Datasets Content .610 .470 .807 .673 .807 .673
Representing Reviews .646 .524 .828 .705 .828 .705
Ranks
the
Aggregation of .484 .360 .795 .511 .630 .405
Tags
Fractions (10) .464 .349 .738 .411 .663 .427
Fractions (FTA)
Tag
.461 .336 .712 .409 .654 .432
Tags

Distributions

Binary (10)
on STS
.531 .361 .770 .550 .623 .422
User Behavior
on STS Binary (FTA) .572 .529 .655 .606 .639 .481
Conclusions & TF (10) .654 .545 .855 .722 .713 .491
Outlook
TF (FTA) .680 .568 .857 .736 .731 .517
Publications

TF (FTA) is the best approach for tags.



Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer Delicious LThing GReads
STS &
Datasets L1 L2 L1 L2 L1 L2
Representing (17) (243) (20) (204) (20) (204)
the
Aggregation of Content .610 .470 .807 .673 .807 .673
Tags
Reviews .646 .524 .828 .705 .828 .705
Tag
Distributions Tags .680 .568 .857 .736 .731 .517
on STS

User Behavior
on STS
Tags clearly outperform content and reviews on
Conclusions &
Outlook Delicious and LibraryThing.
Publications



Arkaitz
Zubiaga

Motivation

Selection of a
STS &
Representing (17) (243) (20) (204) (20) (204)
the
Aggregation of Content .610 .470 .807 .673 .807 .673
Tags
Reviews .646 .524 .828 .705 .828 .705
Tag
Distributions Tags .680 .568 .857 .736 .731 .517
on STS

User Behavior
on STS
GoodReads’ disencouragement to tagging makes it
Conclusions &
Outlook insuﬃcient to outperform content and reviews.
Publications



Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Representing
the
(17) (243) (20) (204) (20) (204)
Aggregation of
Tags
Content .610 .470 .807 .673 .807 .673
Tag Reviews .646 .524 .828 .705 .828 .705
Distributions
on STS Tags .680 .568 .857 .736 .731 .517
User Behavior
on STS

Conclusions &
Tags are also useful for deeper categorization (L2).
Outlook

Publications



PhD Thesis Classifier Committees
Arkaitz
Zubiaga
Despite the superiority of social tags, all data sources
Motivation perform well.
Selection of a
Classifier

STS &
Their outputs can be combined by using classifier
Datasets committees.
Representing
the
Aggregation of
Tags
Classifier committees add up margins (i.e., reliability
Tag values) outputted by several classifiers, and provide a
Distributions
on STS single combined prediction.
User Behavior
on STS

Conclusions &
Cat. #1 Cat. #2 Cat. #3
Outlook
Classif. A 1.2 1.1 0.6
Publications
Classif. B 0.5 1.0 1.2
Classif. committees 1.7 2.1 1.8



PhD Thesis Results of Classifier Committees
Arkaitz
Zubiaga
Motivation
L1 L2 L1 L2 L1 L2
Selection of a
Classifier (17) (243) (20) (204) (20) (204)
STS & Content (C) .610 .470 .807 .673 .807 .673
Datasets
Reviews (R) .646 .524 .828 .705 .828 .705
Representing
the Tags (T) .680 .568 .857 .736 .731 .517
Aggregation of
C+R .670 .547 .817 .704 .817 .704
Commit.

Tags

Tag
Distributions
C+T .696 .587 .821 .720 .832 .696
on STS R+T .694 .584 .859 .755 .857 .730
User Behavior
on STS
C+R+T .699 .588 .827 .732 .843 .727
Conclusions &
Outlook
Classifier committees successfully improve performance.
Publications

Even on GoodReads, where tags were not good enough
on their own.



PhD Thesis Results of Classiﬁer Committees
Arkaitz
Zubiaga
Motivation
L1 L2 L1 L2 L1 L2
Selection of a
Classiﬁer (17) (243) (20) (204) (20) (204)
STS & Content (C) .610 .470 .807 .673 .807 .673
Reviews (R)
Datasets
.646 .524 .828 .705 .828 .705
Representing
the Tags (T) .680 .568 .857 .736 .731 .517
Aggregation of
C+R .670 .547 .817 .704 .817 .704
Commit.

Tags

Tag C+T .696 .587 .821 .720 .832 .696
Distributions
on STS R+T .694 .584 .859 .755 .857 .730
User Behavior C+R+T .699 .588 .827 .732 .843 .727
on STS

Conclusions &
Outlook
Data sources must be chosen with care:
Publications
All 3 are helpful on Delicious.
Content is harmful for books. Inappropriate considering
synopses and ed. reviews as a summary of content?



PhD Thesis Summary of Results
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer
Better represent using all tags with TF weighting.
STS &
Datasets

Representing Tags perform accurately even for deeper levels.
the
Aggregation of The system must encourage the user to tag to make it
Tags
useful enough.
Tag
Distributions
on STS
Tags can be combined with other data to improve
User Behavior
on STS performance.
Conclusions & Combined data sources must be chosen with care.
Outlook

Publications


Tag Distributions on STS

PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation

Selection of a
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
Tag
on STS

User Behavior
Conclusions &
Outlook
Publications

8 Publications



PhD Thesis Tag Distributions
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

So far, we have considered that tags annotated by the
STS &
Datasets

Representing same number of users are equally representative to
the
Aggregation of the resource.
Tags

Tag
Distributions
on STS
Distributions of tags in a collection could help
User Behavior
determine representativity of tags.
on STS

Conclusions &
Outlook

Publications



PhD Thesis TF-IDF
Arkaitz
Zubiaga

Motivation

Selection of a TF-IDF is an inverse weighting function (IWF) that
Classiﬁer
computes:
STS &
Datasets the term frequency (TF).
Representing
the the inverse document frequency (IDF).
Aggregation of
Tags

Tag
|D|
Distributions tf -idfij = tfij × log
on STS |{d : ti ∈ d}|
User Behavior
on STS High IDF value for terms appearing in few documents.
Low IDF value for terms appearing in many documents.
Conclusions &
Outlook

Publications



PhD Thesis Tag Weighting Functions
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer Analogous to TF-IDF on folksonomies:
STS &
Datasets TF-IRF → distributions across resources.
Representing
TF-IUF → distributions across users.
the TF-IBF → distributions across bookmarks.
Aggregation of
Tags

Tag
Distributions
TF-IRF and TF-IUF had been barely used, and their
on STS
suitability was yet unexplored.
User Behavior
on STS

Conclusions & TF-IBF had not been used.
Outlook

Publications



PhD Thesis Results Using IWFs
Arkaitz
Zubiaga

Motivation

Selection of a
STS & L1 L2 L1 L2 L1 L2
Datasets
TF .680 .568 .857 .736 .731 .517
Representing
the TF-IRF .639 .529 .894 .809 .799 .622
IWFs

Aggregation of
Tags TF-IBF .641 .532 .895 .811 .800 .628
Tag TF-IUF .661 .555 .892 .803 .794 .623
Distributions
on STS

All 3 IWFs clearly outperform TF for LibraryThing and
User Behavior
on STS

Conclusions & GoodReads.
Similar performance of IWFs.
Outlook

Publications



PhD Thesis Results Using IWFs
Arkaitz
Zubiaga

Motivation

Selection of a Delicious LThing GReads
Classiﬁer

STS &
L1 L2 L1 L2 L1 L2
Datasets TF .680 .568 .857 .736 .731 .517
Representing
the
TF-IRF .639 .529 .894 .809 .799 .622
IWFs

Aggregation of
Tags
TF-IBF .641 .532 .895 .811 .800 .628
Tag TF-IUF .661 .555 .892 .803 .794 .623
Distributions
on STS

User Behavior
on STS
IWFs underperform on Delicious, due to tag
suggestions that make top tags utmost popular.
Conclusions &
Outlook IUF superior to IBF and IRF. Users who make their
Publications own choices make the diﬀerence.



PhD Thesis Using Classifier Committees with IWFs
Arkaitz
Zubiaga

Motivation

Selection of a
Classifier

STS &
Datasets

Representing
the How about using tags represented with IWFs on classifier
Aggregation of
Tags committees?
Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications



PhD Thesis Results Using IWF with Committees
Arkaitz
Zubiaga

Motivation

Selection of a Delicious LThing GReads
Classiﬁer

STS &
L1 L2 L1 L2 L1 L2
Datasets TF .699 .588 .859 .755 .857 .730
Representing
the
TF-IRF .697 .592 .885 .793 .864 .748
IWFs

Aggregation of
Tags
TF-IBF .698 .592 .887 .797 .866 .751
Tag
TF-IUF .700 .595 .885 .792 .864 .749
Distributions
on STS

User Behavior IWF-based committes are even better than TF-based
ones.
on STS

Conclusions &
Outlook Even on Delicious, where IWFs were not appropriate,
Publications committees perform slightly better.



PhD Thesis Results Using IWF with Committees
Arkaitz
Zubiaga

Motivation

Selection of a
STS & L1 L2 L1 L2 L1 L2
TF
Datasets
.699 .588 .859 .755 .857 .730
Representing
the TF-IRF .697 .592 .885 .793 .864 .748
IWFs

Aggregation of
Tags TF-IBF .698 .592 .887 .797 .866 .751
Tag TF-IUF .700 .595 .885 .792 .864 .749
Distributions
on STS

User Behavior
on STS Despite this outperformance of IWFs using committees,
Conclusions & IWFs on their own perform better on LibraryThing
(.895 & .811).
Outlook

Publications



PhD Thesis Summary of Results Using IWFs
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS & IWFs are an appropriate way to weight tags when used
on classiﬁer committees.
Datasets

Representing
the The exception is LibraryThing, where tags on their own
Aggregation of
Tags perform better.
Tag
Distributions
on STS Combined data sources must be appropriately chosen
User Behavior (e.g., synopses & ed. reviews are harmful with books).
on STS

Conclusions &
Outlook

Publications


User Behavior on STS

PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation

Selection of a
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
Tag
on STS

User Behavior
Conclusions &
Outlook
Publications

8 Publications



PhD Thesis User Behavior: Categorizers and Describers
Arkaitz
Zubiaga

K¨rner1 suggested 2 kinds of user behavior:
Motivation
o
Selection of a
Classifier

STS & Categorizer Describer
Datasets Goal of Tagging later browsing later retrieval
Representing
the
Change of Tag Vocabulary costly cheap
Aggregation of Size of Tag Vocabulary limited open
Tags
Tags subjective objective
Tag
Distributions
on STS

User Behavior They found that Describers help infer semantic
on STS
relations among tags.
Conclusions &
Outlook Do these tagging behaviors affect the usefulness of tags
Publications for resource classification?

1
C. K¨rner. Understanding the Motivation behind Tagging. Hypertext 2009.
o


PhD Thesis Categorizers and Describers
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets

Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications



PhD Thesis Weighting Measures
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets

Representing We use 3 measures to weight users, based on Koerner et al.
the
Aggregation of
(2010).
Tags

Tag
2 factors are considered: verbosity & diversity.
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications



PhD Thesis Weighting Measures: TPP
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets
Tags per Post (TPP) – Verbosity
Representing
the
Aggregation of r
Tags
|Tur |
Tag TPP(u) =
Distributions |Ru |
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications



PhD Thesis Weighting Measures: ORPHAN
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets
Orphan Ratio (ORPHAN) – Diversity
Representing
the |R(tmax )|
Aggregation of n=
Tags 100
Tag
Distributions o
|Tu | o
on STS
ORPHAN(u) = , T = {t||R(t)| ≤ n}
User Behavior |Tu | u
on STS

Conclusions &
Outlook

Publications



PhD Thesis Weighting Measures: TRR
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets

Representing
Tag Resource Ratio (TRR) – Verbosity + Diversity
the
Aggregation of
Tags |Tu |
TRR(u) =
Tag |Ru |
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications



PhD Thesis Use of Weighting Measures
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer These 3 measures provide:
STS &
Datasets
A weight for each user.
Ranking of users according to each measure.
Representing
the
Aggregation of
Tags From rankings → subsets of users as extreme
Tag
Distributions
Categorizers (highest-ranked) and extreme Describers
on STS (lowest-ranked).
User Behavior
on STS

Conclusions & Subsets range from 10% to 100% (step size = 10%).
Outlook

Publications



Arkaitz
Zubiaga
We select subsets of users according to number of tag
Motivation assignments.
Selection of a
Classiﬁer

STS &
Selecting by percents of users would be unfair →
Datasets diﬀerent amounts of data.
Representing
the
Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications



PhD Thesis Experiments
Arkaitz
Zubiaga

Motivation
Classiﬁcation
Selection of a We use a multiclass SVM, with TF weighting of tags.
Classiﬁer

STS &
Descriptivity
Datasets
Vectorial representations of resources:
Tr → tag frequencies.
Representing
the
Aggregation of
Tags
Rr → term frequencies on descriptive data
Tag
(self-content).
Distributions
on STS

User Behavior Cosine similarity between Tr and Rr :
on STS

Conclusions & n
Outlook Tri × Rri
cos(θr ) =
n 2 n 2
Publications
i=1 i=1 (Tri ) × i=1 (Rri )



PhD Thesis Descriptivity Results
Arkaitz
Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.)
Motivation

Selection of a

Delicious
Classiﬁer

STS &
Datasets

Representing
the
LibraryThing

Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
GoodReads

Outlook

Publications



PhD Thesis Classiﬁcation Results
Arkaitz
Motivation

Delicious
Selection of a
Classiﬁer

STS &
Datasets

Representing
the
LibraryThing

Aggregation of
Tags

Tag
Distributions
on STS

User Behavior
on STS

Conclusions &
GoodReads

Outlook

Publications



PhD Thesis Overall Categorizers/Describers Results
Arkaitz
Zubiaga

Motivation

Selection of a
Classifier

STS &

Discriminating by verbosity (TPP) does best for finding
Datasets

Representing
the extreme Categorizers.
Aggregation of
Tags

Tag
Distributions
The use of non-descriptive tags provide more accurate
on STS classification.
User Behavior
on STS

Conclusions &
Outlook

Publications


Conclusions & Outlook

PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation

Selection of a
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
Tag
on STS

User Behavior
Conclusions &
Outlook
Publications

8 Publications



PhD Thesis Contributions
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets Generation & analysis of 3 large-scale social tagging
Representing datasets.
the
Aggregation of

Release of some tagging datasets, used by Godoy and
Tags

Tag
Distributions Amandi (2010), Strohmaier et al. (2010), Li et al. (2011),
on STS
and Ares et al. (2011).
User Behavior
on STS

Conclusions &
Outlook

Publications



PhD Thesis Contributions
Arkaitz
Zubiaga

Motivation

Selection of a
Classifier
First research work performing actual classification
STS &
Datasets experiments using social tags.
Representing Analysis of different representations of social tags.
the
Aggregation of Analysis of effect of tag distributions.
Tags
Study of user behavior.
Tag
Distributions

It paves the way to future researchers interested in the
on STS

User Behavior
on STS task & in the exploration of STS.
Conclusions &
Outlook

Publications



PhD Thesis Research Questions
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

Apart from the Problem Statement:
STS &
Datasets

Representing How can the annotations provided by users on social
the
Aggregation of tagging systems be exploited to improve the accuracy of
Tags a resource classiﬁcation task?
Tag
Distributions
on STS
We set forth 10 research questions.
User Behavior
on STS

Conclusions &
Outlook

Publications



PhD Thesis Research Questions (1)
Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets

Representing RQ 1 What is a suitable SVM classiﬁer for the task?
the
Aggregation of
Tags
Native multiclass SVM >> Combinations of
Tag
Distributions binary SVMs.
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications



Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS & RQ 2 What is a suitable learning method for the
Datasets
task?
Representing
the
Aggregation of
Supervised Semi-supervised.
Tags

Tag
Distributions Unlike for binary tasks, where
on STS
Semi-supervised >> Supervised (Joachims,
User Behavior
on STS 1999).
Conclusions &
Outlook

Publications



Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets
RQ 3 How do the settings of STS aﬀect
Representing
folksonomies?
the
Aggregation of
Tags Great impact of tag suggestions.
Tag
Distributions
on STS
Importance of encouraging users to
User Behavior
on STS annotate.
Conclusions &
Outlook

Publications



Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
RQ 4 How to amalgamate annotations to get a
Datasets representation of a resource?
Representing
the
Aggregation of
Tags
Considering all the tags rather than only
Tag
those in the top.
Distributions
on STS

User Behavior Weighting tags according to number of
on STS
users annotating them.
Conclusions &
Outlook

Publications



Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS & RQ 5 Is it worthwhile combining tags with other
Datasets
data sources?
Representing
the
Aggregation of
Tags
Combining diﬀerent data sources helps
Tag
improve performance.
Distributions
on STS

User Behavior Data sources must be appropriately
on STS
chosen.
Conclusions &
Outlook

Publications



Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets
RQ 6 Are social tags speciﬁc enough to classify into
Representing
narrower categories?
the
Aggregation of
Tags Tags are as useful as for top level.
Tag
Distributions
on STS Noll and Meinel (2008) → tags were
User Behavior
on STS
probably not useful for deeper levels.
Conclusions &
Outlook

Publications



Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer
RQ 7 Can we consider tag distributions to get the
STS &
Datasets representativity of each tag?
Representing
the
Aggregation of LibraryThing & GoodReads: really useful.
Tags

Tag
Distributions Delicious: not useful, because of tag
suggestions
on STS

User Behavior
on STS → need of committees to make them
Conclusions & useful.
Outlook

Publications



Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
RQ 8 What approach to use to weigh the
Datasets representativity of tags?
Representing
the
Aggregation of
Tags
LibraryThing & GoodReads: IBF, IRF &
Tag
IUF are very similar.
Distributions
on STS

User Behavior Delicious: IUF clearly superior, because of
on STS
users that get rid of suggestions.
Conclusions &
Outlook

Publications



Arkaitz
Zubiaga

Motivation

Selection of a
Classifier

STS &
RQ 9 Can we discriminate users who further resemble
Datasets an expert classification?
Representing
the
Aggregation of
Tags
Categorizers > Describers for
Tag
classification.
Distributions
on STS

User Behavior Need of appropriate measure for
on STS
discriminating.
Conclusions &
Outlook

Publications



Arkaitz
Zubiaga

Motivation

Selection of a
Classiﬁer

STS &
Datasets
RQ 10 What features identify a Categorizer?
Representing
the
Aggregation of
Categorizers can be found when
Tags discriminating by verbosity.
Tag
Distributions
on STS
Non-descriptive tags produce more
User Behavior
on STS accurate classiﬁcation.
Conclusions &
Outlook

Publications



PhD Thesis Future Directions
Arkaitz
Zubiaga

Motivation

Selection of a
Classifier
Increase of interest in the field, still much work to do.
STS &
Datasets
We have considered each tag as a diferent token.
→ Considering semantic meanings of social tags could
Representing
the
Aggregation of
Tags help.
Tag
Distributions
on STS Tag suggestions leverage several issues in folksonomies.
User Behavior → Looking for a weighting function that fits the
on STS

Conclusions &
characteristics of systems with tag suggestions, e.g.,
Outlook Delicious.
Publications


Publications

PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation

Selection of a
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
Tag
on STS

User Behavior
Conclusions &
Outlook
Publications

8 Publications


Publications

PhD Thesis Publications
Arkaitz
Zubiaga

Motivation
Peer-Reviewed Conferences (I)
Selection of a Arkaitz Zubiaga, Christian K¨rner, Markus Strohmaier.
o
Classifier
2011. Tags vs Shelves: From Social Tagging to Social
STS &
Datasets Classification. In Proceedings of Hypertext 2011, the
Representing 22nd ACM Conference on Hypertext and Hypermedia,
the
Aggregation of Eindhoven, Netherlands. (acceptance rate: 35/104, 34%)
Tags

Tag
Distributions
on STS
Arkaitz Zubiaga, Raquel Mart´ ınez, V´
ıctor Fresno. 2009.
User Behavior
Getting the Most Out of Social Annotations for Web Page
on STS Classification. In Proceedings of DocEng 2009, the 9th
Conclusions &
Outlook
ACM Symposium on Document Engineering, pp. 74-83,
Publications Munich, Germany. (acceptance rate: 16/54, 29.6%)
[15 citations]


Publications

Arkaitz
Zubiaga

Motivation Peer-Reviewed Conferences (II)
Selection of a
Classiﬁer Arkaitz Zubiaga. 2009. Enhancing Navigation on
STS & Wikipedia with Social Tags. Wikimania 2009, Buenos
Datasets
Aires, Argentina.
Representing
the [6 citations]
Aggregation of
Tags

Tag Arkaitz Zubiaga, Alberto P. Garc´ ıa-Plaza, V´
ıctor Fresno,
Distributions
on STS Raquel Mart´ ınez. 2009. Content-based Clustering for Tag
User Behavior Cloud Visualization. In Proceedings of ASONAM 2009,
on STS

Conclusions &
International Conference on Advances in Social Networks
Outlook Analysis and Mining, pp. 316-319, Athens, Greece.
Publications [3 citations]


Publications

Arkaitz
Zubiaga
Journals (I)
Motivation
Arkaitz Zubiaga, Raquel Mart´
ınez, V´
Selection of a
Classifier Augmenting Web Page Classifiers with Social Annotations.
STS & Procesamiento del Lenguaje Natural. (acceptance rate:
Datasets 33/60, 55%)
Representing
the
Aggregation of
Tags
Arkaitz Zubiaga, Raquel Mart´
ınez, V´
Tag
Clasificaciń de P´ginas Web con Anotaciones Sociales.
o a
Distributions Procesamiento del Lenguaje Natural, vol. 43, pp. 225-233.
on STS
(acceptance rate: 36/72, 50%)
User Behavior
on STS

Conclusions & Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´
ınez. 2009.
Outlook
Comparativa de Aproximaciones a SVM Semisupervisado
Publications
Multiclase para Clasificaciń de P´ginas Web. Procesamiento
o a
del Lenguaje Natural, vol. 42, pp. 63-70.


Publications

Arkaitz
Zubiaga

Motivation

Selection of a
Classifier

STS &
Datasets Journals (II)
Representing
the Arkaitz Zubiaga, V´
ıctor Fresno, Raquel Mart´
ınez. Harnessing
Aggregation of
Tags
Folksonomies to Produce a Social Classification of Resources.
IEEE Transactions on Knowledge and Data Engineering.
Tag
Distributions (pending notification)
on STS

User Behavior
on STS

Conclusions &
Outlook

Publications


Publications

Arkaitz
Zubiaga

Motivation Book Chapters
Selection of a
Classifier
Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´
ınez. 2011.
STS & Exploiting Social Annotations for Resource Classification.
Datasets
Social Network Mining, Analysis and Research
Representing
the Trends: Techniques and Applications. IGI Global.
Aggregation of
Tags Workshops
Tag
Distributions Arkaitz Zubiaga, V´
ıctor Fresno, Raquel Mart´ınez. 2009. Is
on STS
Unlabeled Data Suitable for Multiclass SVM-based Web
User Behavior
on STS Page Classification?. In Proceedings of the NAACL-HLT
Conclusions & 2009 Workshop on Semi-supervised Learning for
Natural Language Processing, pp. 28-36, Boulder, CO,
Outlook

Publications
United States.


Publications

PhD Thesis Thank You
Arkaitz
Zubiaga

Motivation

Selection of a
Achiu Arigato Danke Dhannvaad Dua Netjer en ek
Classiﬁer
Efcharisto Gracias Gr`cies Gratia Grazie
a
Guishepeli Hvala Kiitos K¨sz¨n¨m
o o o
STS &
Datasets

Representing

Merc´ Merci Mila esker Obrigado
the
Aggregation of
Tags
e
Tag
Distributions Shukran Tack Tak Takk T¨anan Tapadh leat
Shukriya
on STS

User Behavior
on STS
Tesekk¨r ederim Thank you Toda
u
Conclusions &
Outlook

Publications

http://thesis.zubiaga.org/


Harnessing Folksonomies for Resource Classification

Recommended

Recommended

More Related Content

Similar to Harnessing Folksonomies for Resource Classification

Similar to Harnessing Folksonomies for Resource Classification (20)

More from azubiaga

More from azubiaga (9)

Recently uploaded

Recently uploaded (20)

Harnessing Folksonomies for Resource Classification