Harnessing Folksonomies for Resource Classification
1. Harnessing Folksonomies for Resource
Classification
PhD Thesis
Arkaitz Zubiaga
UNED
July 12th, 2011
Advisors:
Raquel Mart´ınez Unanue
V´
ıctor Fresno Fern´ndez
a
2. PhD Thesis Table of Contents
Arkaitz
Zubiaga
1 Motivation
Motivation
Selection of a
Classifier 2 Selection of a Classifier
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
4 Representing the Aggregation of Tags
Tag
Distributions 5 Tag Distributions on STS
on STS
User Behavior
on STS 6 User Behavior on STS
Conclusions &
Outlook
7 Conclusions & Outlook
Publications
8 Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 2 / 98
3. Motivation
PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation
Selection of a
Classifier 2 Selection of a Classifier
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
4 Representing the Aggregation of Tags
Tag
Distributions 5 Tag Distributions on STS
on STS
User Behavior
on STS 6 User Behavior on STS
Conclusions &
Outlook
7 Conclusions & Outlook
Publications
8 Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 3 / 98
4. Motivation
PhD Thesis Resource Classification
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Representing
the
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 4 / 98
5. Motivation
PhD Thesis Resource Classification
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Representing
the
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 5 / 98
6. Motivation
PhD Thesis Resource Classification
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Representing
the
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 6 / 98
7. Motivation
PhD Thesis Resource Classification
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier Classifying resources is a common task.
STS &
Datasets
Web pages, books, movies, files,...
Representing
the
Aggregation of
Large collections of resources → expensive & effortful
Tags to classify manually.
Tag
Distributions
LoC reported an average cost of $94.58 for cataloging
on STS each book in 2002.
User Behavior
on STS
Conclusions &
Enormous costs and efforts → automatic classification.
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 7 / 98
8. Motivation
PhD Thesis Resource Classification
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
Representation of resources → self-content.
STS &
Datasets
Representing Use of self-content of resources presents some issues:
the
Aggregation of Not always representative enough.
Tags
Not always accessible (e.g., books).
Tag
Distributions
on STS
User Behavior
Social tags provided by users → alternative to solve the
on STS problem.
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 8 / 98
9. Motivation
PhD Thesis Tagging
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Representing
the
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
T1, T2, T3 = sets of tags.
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 9 / 98
10. Motivation
PhD Thesis Social Tagging
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Representing
the
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications Aggregation of user annotations → folksonomy.
Folksonomy: Folk (People) + Taxis (Classification) +
Nomos (Management).
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 10 / 98
11. Motivation
PhD Thesis Organization of Resources
Arkaitz
Zubiaga
Motivation
Selection of a User annotations → own organization of resources.
Classifier
STS &
Datasets A user’s tags
Representing
the Tag # Resources
Aggregation of
Tags research 82
Tag twitter 28
Distributions
on STS web2.0 35
User Behavior language 42
on STS
english 64
Conclusions &
Outlook ... ...
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 11 / 98
12. Motivation
PhD Thesis Example of Bookmarks
Arkaitz
Zubiaga
Motivation User Resource Tags
Selection of a 1 user1 flickr.com photo, web2.0, social
Classifier
2 user2 flickr.com photography, images
STS &
Datasets 3 user1 google.com searchengine
Representing 4 user3 twitter.com microblogging, twitter
the
Aggregation of
Tags
Tag Bookmark: (1) user ui ∈ U who annotates
(2) resource rj ∈ R being annotated
Distributions
on STS
User Behavior (3) tags Tij = {t1 , ..., tn } ∈ T utilized.
on STS
Conclusions &
Outlook
Publications
bij : ui × rj × Tij
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 12 / 98
13. Motivation
PhD Thesis Sum of Annotations
Arkaitz
Zubiaga
Motivation Top tags (79,681 users)
Selection of a
Classifier
Tag Rank Tag User Count
STS & 1 photos 22,712
Datasets
2 flickr 19,046
Representing
the 3 photography 15,968
Aggregation of
Tags 4 photo 15,225
Tag 5 sharing 10,648
Distributions
on STS 6 images 9,637
User Behavior 7 web2.0 9,528
on STS
Conclusions &
8 community 4,571
Outlook 9 social 3,798
Publications
10 pictures 3,115
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 13 / 98
14. Motivation
PhD Thesis Tag-based Resource Classification
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Representing
the
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 14 / 98
15. Motivation
PhD Thesis Problem Statement
Arkaitz
Zubiaga
Motivation
How can the annotations provided by users on social tagging
Selection of a
systems be exploited to improve the accuracy of a resource
Classifier
classification task?
STS &
Datasets
Representing
the
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 15 / 98
16. Motivation
PhD Thesis Related Work
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Social tags for information management:
Datasets
Search: Bao et al. (2007) & Heymann et al. (2008).
Representing
the
Recommender Systems: Shepitsen et al. (2008) & Li
Aggregation of
Tags
Tag et al. (2008).
Distributions
on STS
User Behavior
on STS
Enhanced Browsing: Smith (2008).
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 16 / 98
17. Motivation
PhD Thesis Related Work
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier Classification: Noll and Meinel (2008) → statistical
STS &
Datasets
analysis of matches between tags & taxonomies.
Representing
Tags are useful for broad categorization.
the Not for narrower categorization.
Aggregation of
Tags
Tag
Distributions
Lack of further research with:
on STS Actual classification experiments.
User Behavior Other types of resources.
on STS
Different representations of social tags.
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 17 / 98
18. Selection of a Classifier
PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation
Selection of a
Classifier 2 Selection of a Classifier
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
4 Representing the Aggregation of Tags
Tag
Distributions 5 Tag Distributions on STS
on STS
User Behavior
on STS 6 User Behavior on STS
Conclusions &
Outlook
7 Conclusions & Outlook
Publications
8 Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 18 / 98
19. Selection of a Classifier
PhD Thesis Characteristics of the task
Arkaitz
Zubiaga
Motivation We have:
Selection of a
Classifier
Large set of resources: some labeled + many unlabeled.
STS &
Multiclass taxonomy.
Datasets
Representing
the
Automated classifiers learn a model from labeled
Aggregation of resources.
Tags
Tag
This model is used to classify unlabeled resources
Distributions afterward.
on STS
User Behavior
on STS 2 learning settings:
Conclusions &
Outlook
Supervised: only labeled resources considered for learning.
Semi-supervised: unlabeled resources are also taken into
Publications
account.
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 19 / 98
20. Selection of a Classifier
PhD Thesis Support Vector Machines (SVM)
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Representing
the
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions & Hyperplane that separates with largest margin.
Outlook
Publications
Use of kernels → redimensions the space.
Resource/Hyperplane margin → Classifier’s reliability.
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 20 / 98
21. Selection of a Classifier
PhD Thesis Selection of a Classifier
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier SVMs solve binary problems by default.
STS &
Datasets
Representing To solve multiclass tasks:
the
Aggregation of Native multiclass classifier (mSVM).
Tags Combining binary classifiers:
Tag one-against-all (oaaSVM).
Distributions
on STS one-against-one (oaoSVM).
User Behavior
on STS
Conclusions &
Both supervised (s) and semi-supervised (ss).
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 21 / 98
22. Selection of a Classifier
PhD Thesis Experiment Settings
Arkaitz
Zubiaga
Motivation
Selection of a
3 benchmark datasets to analyze suitability of
Classifier classifiers:
STS &
Datasets
Representing Dataset # web pages # trainset # categories
BankSearch
the
Aggregation of 10,000 3,000 10
WebKB
Tags
4,518 1,000 6
Tag
Distributions Y! Science 788 100 6
on STS
User Behavior
on STS
We present accuracy to show performance.
Conclusions &
Outlook
Publications We perform 6 runs, and show the average accuracy.
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 22 / 98
23. Selection of a Classifier
PhD Thesis Results
Arkaitz
Zubiaga
Motivation BankSearch WebKB Y! Science
Selection of a mSVM (s) .925 .810 .825
Classifier
mSVM (ss) .923 .778 .836
STS &
Datasets oaaSVM (s) .843 .776 .536
Representing
the
oaaSVM (ss) .842 .773 .565
Aggregation of
Tags
oaoSVM (s) .826 .775 .483
Tag
oaoSVM (ss) .811 .754 .514
Distributions
on STS
User Behavior Native multiclass classifier performs best, while
on STS
Conclusions &
supervised semi-supervised.
Outlook
Publications
We used the supervised approach, as it is
computationally less expensive.
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 23 / 98
24. STS & Datasets
PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation
Selection of a
Classifier 2 Selection of a Classifier
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
4 Representing the Aggregation of Tags
Tag
Distributions 5 Tag Distributions on STS
on STS
User Behavior
on STS 6 User Behavior on STS
Conclusions &
Outlook
7 Conclusions & Outlook
Publications
8 Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 24 / 98
25. STS & Datasets
PhD Thesis Requirements
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets Selected STS should have:
Representing Large communities involved.
the
Aggregation of Public access to data.
Tags
Consolidated taxonomies as a ground truth.
Tag
Distributions
We chose Delicious, LibraryThing & GoodReads.
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 25 / 98
26. STS & Datasets
PhD Thesis Characteristics of STS
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS & Delicious LibraryThing GoodReads
Datasets
Resources web documents books books
Representing
the Tag suggestions based on earlier no based on earlier
Aggregation of bookmarks on the tags utilized by
Tags
resource the user
Tag
Distributions Tag insertion space-separated comma-separated one by one text-
on STS box
User Behavior Saving a resource prompts user to prompts user to user needs to
on STS
add tags add tags at sec- click again to add
Conclusions & ond step tags
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 26 / 98
27. STS & Datasets
PhD Thesis Characteristics of STS
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS & Delicious LibraryThing GoodReads
Datasets
Resources web documents books books
Representing
the Tag suggestions based on earlier no based on earlier
Aggregation of bookmarks on the tags utilized by
Tags
resource the user
Tag
Distributions Tag insertion space-separated comma-separated one by one text-
on STS box
User Behavior Saving a resource prompts user to prompts user to user needs to
on STS
add tags add tags at sec- click again to add
Conclusions & ond step tags
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 27 / 98
28. STS & Datasets
PhD Thesis Characteristics of STS
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS & Delicious LibraryThing GoodReads
Datasets
Resources web documents books books
Representing
the Tag suggestions based on earlier no based on earlier
Aggregation of bookmarks on the tags utilized by
Tags
resource the user
Tag
Distributions Tag insertion space-separated comma-separated one by one text-
on STS box
User Behavior Saving a resource prompts user to prompts user to user needs to
on STS
add tags add tags at sec- click again to add
Conclusions & ond step tags
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 28 / 98
29. STS & Datasets
PhD Thesis Characteristics of STS
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS & Delicious LibraryThing GoodReads
Datasets
Resources web documents books books
Representing
the Tag suggestions based on earlier no based on earlier
Aggregation of bookmarks on the tags utilized by
Tags
resource the user
Tag
Distributions Tag insertion space-separated comma-separated one by one text-
on STS box
User Behavior Saving a resource prompts user to prompts user to user needs to
on STS
add tags add tags at sec- click again to add
Conclusions & ond step tags
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 29 / 98
30. STS & Datasets
PhD Thesis Characteristics of STS
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS & Delicious LibraryThing GoodReads
Datasets
Resources web documents books books
Representing
the Tag suggestions based on earlier no based on earlier
Aggregation of bookmarks on the tags utilized by
Tags
resource the user
Tag
Distributions Tag insertion space-separated comma-separated one by one text-
on STS box
User Behavior Saving a resource prompts user to prompts user to user needs to
on STS
add tags add tags at sec- click again to add
Conclusions & ond step tags
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 30 / 98
31. STS & Datasets
PhD Thesis Retrieval of Categorized Resources
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier Retrieval of popular annotated resources, which were also
STS & categorized by experts.
Datasets
Representing
the
Aggregation of Top level (L1) Second level (L2)
Tags
Tag
Resources Classes Resources Classes
Distributions
on STS
Web ODP 12,616 17 12,286 243
DDC 27,299 10 27,040 99
User Behavior
Books
on STS
LCC 24,861 20 23,565 204
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 31 / 98
32. STS & Datasets
PhD Thesis Retrieval of Additional User Annotations
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
Delicious: 300,571,231 bookmarks
STS &
→ 273,478,137 annotated (91.00%)
Datasets
LibraryThing: 44,612,784 bookmarks
Representing
the
Aggregation of
Tags → 22,343,427 annotated (50.08%)
Tag
Distributions
on STS GoodReads: 47,302,861 bookmarks
User Behavior → 9,323,539 annotated (19.71%)
on STS
Conclusions & Importance of system’s encouragement to tagging
Outlook
Publications
resources.
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 32 / 98
33. STS & Datasets
PhD Thesis Tag Popularity on Resources
Arkaitz
Zubiaga
Motivation
Selection of a 1
Classifier
0.9
STS & Delicious
Datasets
Representing
0.8
LibraryThing
the
Aggregation of
0.7 GoodReads
Average usage
Tags
0.6
Tag
Distributions 0.5
on STS
User Behavior 0.4
on STS
0.3
Conclusions &
Outlook
0.2
Publications
0.1
0
Tag rank on resources
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 33 / 98
34. STS & Datasets
PhD Thesis Tag Novelty in Bookmarks by Rank
Arkaitz
Zubiaga
100%
Motivation
Selection of a
Classifier
90% Delicious
STS & 80% LibraryThing
Datasets
70%
Representing
the
Aggregation of
60%
% of novelty
Tags
50%
Tag
Distributions
40%
on STS
User Behavior 30%
on STS
Conclusions & 20%
Outlook
10%
Publications
0%
3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97
Bookmark rank
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 34 / 98
35. STS & Datasets
PhD Thesis Retrieval of Additional Data
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier URLs:
STS &
Datasets
Self-content, by crawling URLs.
Representing
User reviews (Delicious & StumbleUpon).
the
Aggregation of
Tags Books:
Tag
Distributions
Self-content (unavailable):
on STS Synopses (Barnes&Noble).
User Behavior Editorial reviews (Amazon).
on STS
User reviews (LibraryThing, GoodReads & Amazon).
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 35 / 98
36. STS & Datasets
PhD Thesis Summary of the Analysis of Datasets
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Few users annotate resources when the system does
not encourage to do it.
Representing
the
Aggregation of
Tags
Tag Resource-based tag suggestions → Repeated use of
Distributions
on STS popular tags.
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 36 / 98
37. Representing the Aggregation of Tags
PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation
Selection of a
Classifier 2 Selection of a Classifier
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
4 Representing the Aggregation of Tags
Tag
Distributions 5 Tag Distributions on STS
on STS
User Behavior
on STS 6 User Behavior on STS
Conclusions &
Outlook
7 Conclusions & Outlook
Publications
8 Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 37 / 98
38. Representing the Aggregation of Tags
PhD Thesis Representing Resources Using Tags
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets Different ways to aggregate user annotations on a
Representing vectorial representation.
the
Aggregation of
Tags
Tag
2 major factors to consider:
Distributions What tags to use?
on STS
How to weigh those tags?
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 38 / 98
39. Representing the Aggregation of Tags
PhD Thesis Representing Resources Using Tags
Arkaitz
Zubiaga
Use of all tags (FTA), or just top 10 tags for each
Motivation
resource.
Selection of a
Classifier 4 different weightings.
STS &
Datasets Example of a resource (100 users):
Representing t1 (50), t2 (30), t3 (20), ..., t9 (1), t10 (1), ..., tn (1)
the
Aggregation of
Tags
Tag FTA
Top 10
Distributions
on STS
User Behavior t1 t2 t3 ... t9 t10 ... tn
on STS
Ranks 1 0.9 0.8 ... 0.2 0.1 ... 0
Conclusions &
Outlook Fractions 0.5 0.3 0.2 ... 0.02 0.01 ... 0.01
Publications Binary 1 1 1 ... 1 1 ... 1
TF 50 30 20 ... 2 1 ... 1
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 39 / 98
40. Representing the Aggregation of Tags
PhD Thesis Representing Resources Using Other Data
Arkaitz Sources
Zubiaga
Motivation
Selection of a
Classifier
STS & To represent resources using content and reviews:
Datasets
Representing
1 Removal of HTML tags.
the
Aggregation of
Tags 2 Removal of stopwords.
Tag
Distributions
on STS 3 Stem of remaining words.
User Behavior
on STS
Conclusions &
4 TF-IDF weighting of words.
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 40 / 98
41. Representing the Aggregation of Tags
PhD Thesis Experiment Setup
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier Multiclass SVMs.
STS &
Datasets
Representing
Show the average accuracy of 6 runs.
the
Aggregation of
Tags For clarity of presentation, we limit results to:
Tag
Distributions
LCC taxonomy for books.
on STS Training sets of 6,000 URLs (6,616 (L1)/6,286 (L2) for
User Behavior test).
Training sets of 18,000 books (8,861 (L1)/5,565 (L2) for
on STS
Conclusions &
Outlook test).
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 41 / 98
42. Representing the Aggregation of Tags
PhD Thesis Experiment Setup
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
Compared Representations
STS & Self-content (baseline).
Datasets
Representing
the Reviews.
Aggregation of
Tags
Tag Tags:
Distributions
on STS Ranks (Top 10).
User Behavior Fractions (Top 10 & FTA).
on STS
Binary (Top 10 & FTA).
Conclusions &
Outlook TF (Top 10 & FTA).
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 42 / 98
43. Representing the Aggregation of Tags
PhD Thesis Results of Tags vs Other Data Sources
Arkaitz
Zubiaga
Motivation
Delicious LThing GReads
Selection of a
L1 L2 L1 L2 L1 L2
Classifier
(17) (243) (20) (204) (20) (204)
STS &
Datasets Content .610 .470 .807 .673 .807 .673
Representing Reviews .646 .524 .828 .705 .828 .705
Ranks
the
Aggregation of .484 .360 .795 .511 .630 .405
Fractions (10)
Tags
.464 .349 .738 .411 .663 .427
Tag
Fractions (FTA) .461 .336 .712 .409 .654 .432
Tags
Distributions
Binary (10)
on STS
.531 .361 .770 .550 .623 .422
User Behavior
on STS Binary (FTA) .572 .529 .655 .606 .639 .481
Conclusions & TF (10) .654 .545 .855 .722 .713 .491
Outlook
Publications
TF (FTA) .680 .568 .857 .736 .731 .517
Usually, FTA > 10.
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 43 / 98
44. Representing the Aggregation of Tags
PhD Thesis Results of Tags vs Other Data Sources
Arkaitz
Zubiaga
Motivation
Delicious LThing GReads
Selection of a
L1 L2 L1 L2 L1 L2
Classifier (17) (243) (20) (204) (20) (204)
STS &
Datasets Content .610 .470 .807 .673 .807 .673
Representing Reviews .646 .524 .828 .705 .828 .705
Ranks
the
Aggregation of .484 .360 .795 .511 .630 .405
Tags
Fractions (10) .464 .349 .738 .411 .663 .427
Fractions (FTA)
Tag
.461 .336 .712 .409 .654 .432
Tags
Distributions
Binary (10)
on STS
.531 .361 .770 .550 .623 .422
User Behavior
on STS Binary (FTA) .572 .529 .655 .606 .639 .481
Conclusions & TF (10) .654 .545 .855 .722 .713 .491
Outlook
TF (FTA) .680 .568 .857 .736 .731 .517
Publications
TF (FTA) is the best approach for tags.
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 44 / 98
45. Representing the Aggregation of Tags
PhD Thesis Results of Tags vs Other Data Sources
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier Delicious LThing GReads
STS &
Datasets L1 L2 L1 L2 L1 L2
Representing (17) (243) (20) (204) (20) (204)
the
Aggregation of Content .610 .470 .807 .673 .807 .673
Tags
Reviews .646 .524 .828 .705 .828 .705
Tag
Distributions Tags .680 .568 .857 .736 .731 .517
on STS
User Behavior
on STS
Tags clearly outperform content and reviews on
Conclusions &
Outlook Delicious and LibraryThing.
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 45 / 98
46. Representing the Aggregation of Tags
PhD Thesis Results of Tags vs Other Data Sources
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier Delicious LThing GReads
STS &
Datasets L1 L2 L1 L2 L1 L2
Representing (17) (243) (20) (204) (20) (204)
the
Aggregation of Content .610 .470 .807 .673 .807 .673
Tags
Reviews .646 .524 .828 .705 .828 .705
Tag
Distributions Tags .680 .568 .857 .736 .731 .517
on STS
User Behavior
on STS
GoodReads’ disencouragement to tagging makes it
Conclusions &
Outlook insufficient to outperform content and reviews.
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 46 / 98
47. Representing the Aggregation of Tags
PhD Thesis Results of Tags vs Other Data Sources
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Delicious LThing GReads
Datasets L1 L2 L1 L2 L1 L2
Representing
the
(17) (243) (20) (204) (20) (204)
Aggregation of
Tags
Content .610 .470 .807 .673 .807 .673
Tag Reviews .646 .524 .828 .705 .828 .705
Distributions
on STS Tags .680 .568 .857 .736 .731 .517
User Behavior
on STS
Conclusions &
Tags are also useful for deeper categorization (L2).
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 47 / 98
48. Representing the Aggregation of Tags
PhD Thesis Classifier Committees
Arkaitz
Zubiaga
Despite the superiority of social tags, all data sources
Motivation perform well.
Selection of a
Classifier
STS &
Their outputs can be combined by using classifier
Datasets committees.
Representing
the
Aggregation of
Tags
Classifier committees add up margins (i.e., reliability
Tag values) outputted by several classifiers, and provide a
Distributions
on STS single combined prediction.
User Behavior
on STS
Conclusions &
Cat. #1 Cat. #2 Cat. #3
Outlook
Classif. A 1.2 1.1 0.6
Publications
Classif. B 0.5 1.0 1.2
Classif. committees 1.7 2.1 1.8
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 48 / 98
49. Representing the Aggregation of Tags
PhD Thesis Results of Classifier Committees
Arkaitz
Zubiaga
Delicious LThing GReads
Motivation
L1 L2 L1 L2 L1 L2
Selection of a
Classifier (17) (243) (20) (204) (20) (204)
STS & Content (C) .610 .470 .807 .673 .807 .673
Datasets
Reviews (R) .646 .524 .828 .705 .828 .705
Representing
the Tags (T) .680 .568 .857 .736 .731 .517
Aggregation of
C+R .670 .547 .817 .704 .817 .704
Commit.
Tags
Tag
Distributions
C+T .696 .587 .821 .720 .832 .696
on STS R+T .694 .584 .859 .755 .857 .730
User Behavior
on STS
C+R+T .699 .588 .827 .732 .843 .727
Conclusions &
Outlook
Classifier committees successfully improve performance.
Publications
Even on GoodReads, where tags were not good enough
on their own.
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 49 / 98
50. Representing the Aggregation of Tags
PhD Thesis Results of Classifier Committees
Arkaitz
Zubiaga
Delicious LThing GReads
Motivation
L1 L2 L1 L2 L1 L2
Selection of a
Classifier (17) (243) (20) (204) (20) (204)
STS & Content (C) .610 .470 .807 .673 .807 .673
Reviews (R)
Datasets
.646 .524 .828 .705 .828 .705
Representing
the Tags (T) .680 .568 .857 .736 .731 .517
Aggregation of
C+R .670 .547 .817 .704 .817 .704
Commit.
Tags
Tag C+T .696 .587 .821 .720 .832 .696
Distributions
on STS R+T .694 .584 .859 .755 .857 .730
User Behavior C+R+T .699 .588 .827 .732 .843 .727
on STS
Conclusions &
Outlook
Data sources must be chosen with care:
Publications
All 3 are helpful on Delicious.
Content is harmful for books. Inappropriate considering
synopses and ed. reviews as a summary of content?
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 50 / 98
51. Representing the Aggregation of Tags
PhD Thesis Summary of Results
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
Better represent using all tags with TF weighting.
STS &
Datasets
Representing Tags perform accurately even for deeper levels.
the
Aggregation of The system must encourage the user to tag to make it
Tags
useful enough.
Tag
Distributions
on STS
Tags can be combined with other data to improve
User Behavior
on STS performance.
Conclusions & Combined data sources must be chosen with care.
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 51 / 98
52. Tag Distributions on STS
PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation
Selection of a
Classifier 2 Selection of a Classifier
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
4 Representing the Aggregation of Tags
Tag
Distributions 5 Tag Distributions on STS
on STS
User Behavior
on STS 6 User Behavior on STS
Conclusions &
Outlook
7 Conclusions & Outlook
Publications
8 Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 52 / 98
53. Tag Distributions on STS
PhD Thesis Tag Distributions
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
So far, we have considered that tags annotated by the
STS &
Datasets
Representing same number of users are equally representative to
the
Aggregation of the resource.
Tags
Tag
Distributions
on STS
Distributions of tags in a collection could help
User Behavior
determine representativity of tags.
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 53 / 98
54. Tag Distributions on STS
PhD Thesis TF-IDF
Arkaitz
Zubiaga
Motivation
Selection of a TF-IDF is an inverse weighting function (IWF) that
Classifier
computes:
STS &
Datasets the term frequency (TF).
Representing
the the inverse document frequency (IDF).
Aggregation of
Tags
Tag
|D|
Distributions tf -idfij = tfij × log
on STS |{d : ti ∈ d}|
User Behavior
on STS High IDF value for terms appearing in few documents.
Low IDF value for terms appearing in many documents.
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 54 / 98
55. Tag Distributions on STS
PhD Thesis Tag Weighting Functions
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier Analogous to TF-IDF on folksonomies:
STS &
Datasets TF-IRF → distributions across resources.
Representing
TF-IUF → distributions across users.
the TF-IBF → distributions across bookmarks.
Aggregation of
Tags
Tag
Distributions
TF-IRF and TF-IUF had been barely used, and their
on STS
suitability was yet unexplored.
User Behavior
on STS
Conclusions & TF-IBF had not been used.
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 55 / 98
56. Tag Distributions on STS
PhD Thesis Results Using IWFs
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier Delicious LThing GReads
STS & L1 L2 L1 L2 L1 L2
Datasets
TF .680 .568 .857 .736 .731 .517
Representing
the TF-IRF .639 .529 .894 .809 .799 .622
IWFs
Aggregation of
Tags TF-IBF .641 .532 .895 .811 .800 .628
Tag TF-IUF .661 .555 .892 .803 .794 .623
Distributions
on STS
All 3 IWFs clearly outperform TF for LibraryThing and
User Behavior
on STS
Conclusions & GoodReads.
Similar performance of IWFs.
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 56 / 98
57. Tag Distributions on STS
PhD Thesis Results Using IWFs
Arkaitz
Zubiaga
Motivation
Selection of a Delicious LThing GReads
Classifier
STS &
L1 L2 L1 L2 L1 L2
Datasets TF .680 .568 .857 .736 .731 .517
Representing
the
TF-IRF .639 .529 .894 .809 .799 .622
IWFs
Aggregation of
Tags
TF-IBF .641 .532 .895 .811 .800 .628
Tag TF-IUF .661 .555 .892 .803 .794 .623
Distributions
on STS
User Behavior
on STS
IWFs underperform on Delicious, due to tag
suggestions that make top tags utmost popular.
Conclusions &
Outlook IUF superior to IBF and IRF. Users who make their
Publications own choices make the difference.
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 57 / 98
58. Tag Distributions on STS
PhD Thesis Using Classifier Committees with IWFs
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Representing
the How about using tags represented with IWFs on classifier
Aggregation of
Tags committees?
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 58 / 98
59. Tag Distributions on STS
PhD Thesis Results Using IWF with Committees
Arkaitz
Zubiaga
Motivation
Selection of a Delicious LThing GReads
Classifier
STS &
L1 L2 L1 L2 L1 L2
Datasets TF .699 .588 .859 .755 .857 .730
Representing
the
TF-IRF .697 .592 .885 .793 .864 .748
IWFs
Aggregation of
Tags
TF-IBF .698 .592 .887 .797 .866 .751
Tag
TF-IUF .700 .595 .885 .792 .864 .749
Distributions
on STS
User Behavior IWF-based committes are even better than TF-based
ones.
on STS
Conclusions &
Outlook Even on Delicious, where IWFs were not appropriate,
Publications committees perform slightly better.
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 59 / 98
60. Tag Distributions on STS
PhD Thesis Results Using IWF with Committees
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier Delicious LThing GReads
STS & L1 L2 L1 L2 L1 L2
TF
Datasets
.699 .588 .859 .755 .857 .730
Representing
the TF-IRF .697 .592 .885 .793 .864 .748
IWFs
Aggregation of
Tags TF-IBF .698 .592 .887 .797 .866 .751
Tag TF-IUF .700 .595 .885 .792 .864 .749
Distributions
on STS
User Behavior
on STS Despite this outperformance of IWFs using committees,
Conclusions & IWFs on their own perform better on LibraryThing
(.895 & .811).
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 60 / 98
61. Tag Distributions on STS
PhD Thesis Summary of Results Using IWFs
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS & IWFs are an appropriate way to weight tags when used
on classifier committees.
Datasets
Representing
the The exception is LibraryThing, where tags on their own
Aggregation of
Tags perform better.
Tag
Distributions
on STS Combined data sources must be appropriately chosen
User Behavior (e.g., synopses & ed. reviews are harmful with books).
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 61 / 98
62. User Behavior on STS
PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation
Selection of a
Classifier 2 Selection of a Classifier
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
4 Representing the Aggregation of Tags
Tag
Distributions 5 Tag Distributions on STS
on STS
User Behavior
on STS 6 User Behavior on STS
Conclusions &
Outlook
7 Conclusions & Outlook
Publications
8 Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 62 / 98
63. User Behavior on STS
PhD Thesis User Behavior: Categorizers and Describers
Arkaitz
Zubiaga
K¨rner1 suggested 2 kinds of user behavior:
Motivation
o
Selection of a
Classifier
STS & Categorizer Describer
Datasets Goal of Tagging later browsing later retrieval
Representing
the
Change of Tag Vocabulary costly cheap
Aggregation of Size of Tag Vocabulary limited open
Tags
Tags subjective objective
Tag
Distributions
on STS
User Behavior They found that Describers help infer semantic
on STS
relations among tags.
Conclusions &
Outlook Do these tagging behaviors affect the usefulness of tags
Publications for resource classification?
1
C. K¨rner. Understanding the Motivation behind Tagging. Hypertext 2009.
o
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 63 / 98
64. User Behavior on STS
PhD Thesis Categorizers and Describers
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Representing
the
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 64 / 98
65. User Behavior on STS
PhD Thesis Categorizers and Describers
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Representing
the
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 65 / 98
66. User Behavior on STS
PhD Thesis Weighting Measures
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Representing We use 3 measures to weight users, based on Koerner et al.
the
Aggregation of
(2010).
Tags
Tag
2 factors are considered: verbosity & diversity.
Distributions
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 66 / 98
67. User Behavior on STS
PhD Thesis Weighting Measures: TPP
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Tags per Post (TPP) – Verbosity
Representing
the
Aggregation of r
Tags
|Tur |
Tag TPP(u) =
Distributions |Ru |
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 67 / 98
68. User Behavior on STS
PhD Thesis Weighting Measures: ORPHAN
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Orphan Ratio (ORPHAN) – Diversity
Representing
the |R(tmax )|
Aggregation of n=
Tags 100
Tag
Distributions o
|Tu | o
on STS
ORPHAN(u) = , T = {t||R(t)| ≤ n}
User Behavior |Tu | u
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 68 / 98
69. User Behavior on STS
PhD Thesis Weighting Measures: TRR
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Representing
Tag Resource Ratio (TRR) – Verbosity + Diversity
the
Aggregation of
Tags |Tu |
TRR(u) =
Tag |Ru |
Distributions
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 69 / 98
70. User Behavior on STS
PhD Thesis Use of Weighting Measures
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier These 3 measures provide:
STS &
Datasets
A weight for each user.
Ranking of users according to each measure.
Representing
the
Aggregation of
Tags From rankings → subsets of users as extreme
Tag
Distributions
Categorizers (highest-ranked) and extreme Describers
on STS (lowest-ranked).
User Behavior
on STS
Conclusions & Subsets range from 10% to 100% (step size = 10%).
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 70 / 98
71. User Behavior on STS
PhD Thesis Experiment Setup
Arkaitz
Zubiaga
We select subsets of users according to number of tag
Motivation assignments.
Selection of a
Classifier
STS &
Selecting by percents of users would be unfair →
Datasets different amounts of data.
Representing
the
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 71 / 98
72. User Behavior on STS
PhD Thesis Experiments
Arkaitz
Zubiaga
Motivation
Classification
Selection of a We use a multiclass SVM, with TF weighting of tags.
Classifier
STS &
Descriptivity
Datasets
Vectorial representations of resources:
Tr → tag frequencies.
Representing
the
Aggregation of
Tags
Rr → term frequencies on descriptive data
Tag
(self-content).
Distributions
on STS
User Behavior Cosine similarity between Tr and Rr :
on STS
Conclusions & n
Outlook Tri × Rri
cos(θr ) =
n 2 n 2
Publications
i=1 i=1 (Tri ) × i=1 (Rri )
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 72 / 98
73. User Behavior on STS
PhD Thesis Descriptivity Results
Arkaitz
Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.)
Motivation
Selection of a
Delicious
Classifier
STS &
Datasets
Representing
the
LibraryThing
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
GoodReads
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 73 / 98
74. User Behavior on STS
PhD Thesis Classification Results
Arkaitz
Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.)
Motivation
Delicious
Selection of a
Classifier
STS &
Datasets
Representing
the
LibraryThing
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
GoodReads
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 74 / 98
75. User Behavior on STS
PhD Thesis Classification Results
Arkaitz
Zubiaga TPP (Verb.) ORPHAN (Div.) TRR (V. + D.)
Motivation
Delicious
Selection of a
Classifier
STS &
Datasets
Representing
the
LibraryThing
Aggregation of
Tags
Tag
Distributions
on STS
User Behavior
on STS
Conclusions &
GoodReads
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 75 / 98
76. User Behavior on STS
PhD Thesis Overall Categorizers/Describers Results
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Discriminating by verbosity (TPP) does best for finding
Datasets
Representing
the extreme Categorizers.
Aggregation of
Tags
Tag
Distributions
The use of non-descriptive tags provide more accurate
on STS classification.
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 76 / 98
77. Conclusions & Outlook
PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation
Selection of a
Classifier 2 Selection of a Classifier
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
4 Representing the Aggregation of Tags
Tag
Distributions 5 Tag Distributions on STS
on STS
User Behavior
on STS 6 User Behavior on STS
Conclusions &
Outlook
7 Conclusions & Outlook
Publications
8 Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 77 / 98
78. Conclusions & Outlook
PhD Thesis Contributions
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets Generation & analysis of 3 large-scale social tagging
Representing datasets.
the
Aggregation of
Release of some tagging datasets, used by Godoy and
Tags
Tag
Distributions Amandi (2010), Strohmaier et al. (2010), Li et al. (2011),
on STS
and Ares et al. (2011).
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 78 / 98
79. Conclusions & Outlook
PhD Thesis Contributions
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
First research work performing actual classification
STS &
Datasets experiments using social tags.
Representing Analysis of different representations of social tags.
the
Aggregation of Analysis of effect of tag distributions.
Tags
Study of user behavior.
Tag
Distributions
It paves the way to future researchers interested in the
on STS
User Behavior
on STS task & in the exploration of STS.
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 79 / 98
80. Conclusions & Outlook
PhD Thesis Research Questions
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
Apart from the Problem Statement:
STS &
Datasets
Representing How can the annotations provided by users on social
the
Aggregation of tagging systems be exploited to improve the accuracy of
Tags a resource classification task?
Tag
Distributions
on STS
We set forth 10 research questions.
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 80 / 98
81. Conclusions & Outlook
PhD Thesis Research Questions (1)
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
Representing RQ 1 What is a suitable SVM classifier for the task?
the
Aggregation of
Tags
Native multiclass SVM >> Combinations of
Tag
Distributions binary SVMs.
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 81 / 98
82. Conclusions & Outlook
PhD Thesis Research Questions (2)
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS & RQ 2 What is a suitable learning method for the
Datasets
task?
Representing
the
Aggregation of
Supervised Semi-supervised.
Tags
Tag
Distributions Unlike for binary tasks, where
on STS
Semi-supervised >> Supervised (Joachims,
User Behavior
on STS 1999).
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 82 / 98
83. Conclusions & Outlook
PhD Thesis Research Questions (3)
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
RQ 3 How do the settings of STS affect
Representing
folksonomies?
the
Aggregation of
Tags Great impact of tag suggestions.
Tag
Distributions
on STS
Importance of encouraging users to
User Behavior
on STS annotate.
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 83 / 98
84. Conclusions & Outlook
PhD Thesis Research Questions (4)
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
RQ 4 How to amalgamate annotations to get a
Datasets representation of a resource?
Representing
the
Aggregation of
Tags
Considering all the tags rather than only
Tag
those in the top.
Distributions
on STS
User Behavior Weighting tags according to number of
on STS
users annotating them.
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 84 / 98
85. Conclusions & Outlook
PhD Thesis Research Questions (5)
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS & RQ 5 Is it worthwhile combining tags with other
Datasets
data sources?
Representing
the
Aggregation of
Tags
Combining different data sources helps
Tag
improve performance.
Distributions
on STS
User Behavior Data sources must be appropriately
on STS
chosen.
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 85 / 98
86. Conclusions & Outlook
PhD Thesis Research Questions (6)
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
RQ 6 Are social tags specific enough to classify into
Representing
narrower categories?
the
Aggregation of
Tags Tags are as useful as for top level.
Tag
Distributions
on STS Noll and Meinel (2008) → tags were
User Behavior
on STS
probably not useful for deeper levels.
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 86 / 98
87. Conclusions & Outlook
PhD Thesis Research Questions (7)
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
RQ 7 Can we consider tag distributions to get the
STS &
Datasets representativity of each tag?
Representing
the
Aggregation of LibraryThing & GoodReads: really useful.
Tags
Tag
Distributions Delicious: not useful, because of tag
suggestions
on STS
User Behavior
on STS → need of committees to make them
Conclusions & useful.
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 87 / 98
88. Conclusions & Outlook
PhD Thesis Research Questions (8)
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
RQ 8 What approach to use to weigh the
Datasets representativity of tags?
Representing
the
Aggregation of
Tags
LibraryThing & GoodReads: IBF, IRF &
Tag
IUF are very similar.
Distributions
on STS
User Behavior Delicious: IUF clearly superior, because of
on STS
users that get rid of suggestions.
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 88 / 98
89. Conclusions & Outlook
PhD Thesis Research Questions (9)
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
RQ 9 Can we discriminate users who further resemble
Datasets an expert classification?
Representing
the
Aggregation of
Tags
Categorizers > Describers for
Tag
classification.
Distributions
on STS
User Behavior Need of appropriate measure for
on STS
discriminating.
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 89 / 98
90. Conclusions & Outlook
PhD Thesis Research Questions (10)
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets
RQ 10 What features identify a Categorizer?
Representing
the
Aggregation of
Categorizers can be found when
Tags discriminating by verbosity.
Tag
Distributions
on STS
Non-descriptive tags produce more
User Behavior
on STS accurate classification.
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 90 / 98
91. Conclusions & Outlook
PhD Thesis Future Directions
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
Increase of interest in the field, still much work to do.
STS &
Datasets
We have considered each tag as a diferent token.
→ Considering semantic meanings of social tags could
Representing
the
Aggregation of
Tags help.
Tag
Distributions
on STS Tag suggestions leverage several issues in folksonomies.
User Behavior → Looking for a weighting function that fits the
on STS
Conclusions &
characteristics of systems with tag suggestions, e.g.,
Outlook Delicious.
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 91 / 98
92. Publications
PhD Thesis Index
Arkaitz
Zubiaga
1 Motivation
Motivation
Selection of a
Classifier 2 Selection of a Classifier
STS &
Datasets
3 STS & Datasets
Representing
the
Aggregation of
Tags
4 Representing the Aggregation of Tags
Tag
Distributions 5 Tag Distributions on STS
on STS
User Behavior
on STS 6 User Behavior on STS
Conclusions &
Outlook
7 Conclusions & Outlook
Publications
8 Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 92 / 98
93. Publications
PhD Thesis Publications
Arkaitz
Zubiaga
Motivation
Peer-Reviewed Conferences (I)
Selection of a Arkaitz Zubiaga, Christian K¨rner, Markus Strohmaier.
o
Classifier
2011. Tags vs Shelves: From Social Tagging to Social
STS &
Datasets Classification. In Proceedings of Hypertext 2011, the
Representing 22nd ACM Conference on Hypertext and Hypermedia,
the
Aggregation of Eindhoven, Netherlands. (acceptance rate: 35/104, 34%)
Tags
Tag
Distributions
on STS
Arkaitz Zubiaga, Raquel Mart´ ınez, V´
ıctor Fresno. 2009.
User Behavior
Getting the Most Out of Social Annotations for Web Page
on STS Classification. In Proceedings of DocEng 2009, the 9th
Conclusions &
Outlook
ACM Symposium on Document Engineering, pp. 74-83,
Publications Munich, Germany. (acceptance rate: 16/54, 29.6%)
[15 citations]
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 93 / 98
94. Publications
PhD Thesis Publications
Arkaitz
Zubiaga
Motivation Peer-Reviewed Conferences (II)
Selection of a
Classifier Arkaitz Zubiaga. 2009. Enhancing Navigation on
STS & Wikipedia with Social Tags. Wikimania 2009, Buenos
Datasets
Aires, Argentina.
Representing
the [6 citations]
Aggregation of
Tags
Tag Arkaitz Zubiaga, Alberto P. Garc´ ıa-Plaza, V´
ıctor Fresno,
Distributions
on STS Raquel Mart´ ınez. 2009. Content-based Clustering for Tag
User Behavior Cloud Visualization. In Proceedings of ASONAM 2009,
on STS
Conclusions &
International Conference on Advances in Social Networks
Outlook Analysis and Mining, pp. 316-319, Athens, Greece.
Publications [3 citations]
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 94 / 98
95. Publications
PhD Thesis Publications
Arkaitz
Zubiaga
Journals (I)
Motivation
Arkaitz Zubiaga, Raquel Mart´
ınez, V´
ıctor Fresno. 2011.
Selection of a
Classifier Augmenting Web Page Classifiers with Social Annotations.
STS & Procesamiento del Lenguaje Natural. (acceptance rate:
Datasets 33/60, 55%)
Representing
the
Aggregation of
Tags
Arkaitz Zubiaga, Raquel Mart´
ınez, V´
ıctor Fresno. 2009.
Tag
Clasificaci´n de P´ginas Web con Anotaciones Sociales.
o a
Distributions Procesamiento del Lenguaje Natural, vol. 43, pp. 225-233.
on STS
(acceptance rate: 36/72, 50%)
User Behavior
on STS
Conclusions & Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´
ınez. 2009.
Outlook
Comparativa de Aproximaciones a SVM Semisupervisado
Publications
Multiclase para Clasificaci´n de P´ginas Web. Procesamiento
o a
del Lenguaje Natural, vol. 42, pp. 63-70.
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 95 / 98
96. Publications
PhD Thesis Publications
Arkaitz
Zubiaga
Motivation
Selection of a
Classifier
STS &
Datasets Journals (II)
Representing
the Arkaitz Zubiaga, V´
ıctor Fresno, Raquel Mart´
ınez. Harnessing
Aggregation of
Tags
Folksonomies to Produce a Social Classification of Resources.
IEEE Transactions on Knowledge and Data Engineering.
Tag
Distributions (pending notification)
on STS
User Behavior
on STS
Conclusions &
Outlook
Publications
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 96 / 98
97. Publications
PhD Thesis Publications
Arkaitz
Zubiaga
Motivation Book Chapters
Selection of a
Classifier
Arkaitz Zubiaga, V´ıctor Fresno, Raquel Mart´
ınez. 2011.
STS & Exploiting Social Annotations for Resource Classification.
Datasets
Social Network Mining, Analysis and Research
Representing
the Trends: Techniques and Applications. IGI Global.
Aggregation of
Tags Workshops
Tag
Distributions Arkaitz Zubiaga, V´
ıctor Fresno, Raquel Mart´ınez. 2009. Is
on STS
Unlabeled Data Suitable for Multiclass SVM-based Web
User Behavior
on STS Page Classification?. In Proceedings of the NAACL-HLT
Conclusions & 2009 Workshop on Semi-supervised Learning for
Natural Language Processing, pp. 28-36, Boulder, CO,
Outlook
Publications
United States.
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 97 / 98
98. Publications
PhD Thesis Thank You
Arkaitz
Zubiaga
Motivation
Selection of a
Achiu Arigato Danke Dhannvaad Dua Netjer en ek
Classifier
Efcharisto Gracias Gr`cies Gratia Grazie
a
Guishepeli Hvala Kiitos K¨sz¨n¨m
o o o
STS &
Datasets
Representing
Merc´ Merci Mila esker Obrigado
the
Aggregation of
Tags
e
Tag
Distributions Shukran Tack Tak Takk T¨anan Tapadh leat
Shukriya
on STS
User Behavior
on STS
Tesekk¨r ederim Thank you Toda
u
Conclusions &
Outlook
Publications
http://thesis.zubiaga.org/
Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 98 / 98