Usage-Based vs. Citation-Based Recommenders in a Digital Library

Usage-Based vs. Citation-Based
Recommenders in a Digital Library

André Vellino

School of Information Studies
University of Ottawa
blog: http://synthese.wordpress.com
twitter: @vellino

e-mail: avellino@uottawa.ca

Context
—  Canada Institute for Scientific and Technical Information
(aka Canada’s National Science Library)
—  Has a full-text digital collection (Scientific, Technical,
Medical) with text-mining rights for research purposes only
—  Elsevier and Springer (mostly)
—  ~8M articles
—  ~2800 journals
—  ~ 3TB
—  Plan: a Hybrid, Multi-Dimensional
—  Usage-based (CF)
—  Content-based (CBF)
—  User-Context

Sparsity of Usage Data is a Problem in
Digital Libraries
Amazon Digital Libraries
Users Items Items
Users

~70,000

~ 70 M ~ 93 M ~7M

Data is Sparse Too

edges user-item graph

—  Sparseness of a dataset S =
total number of possible edges

—  Mendeley data S = 2.66 x 10-05
—  Neflix S = 1.18 x 10-02

—  But also, Mendeley data isn’t “highly connected”
—  83.6% of Mendeley articles were referenced by only 1 user
—  6% of the articles were referenced by 3 or more users.

(2009)

ExLibris bX solution to data sparsity:
Harvest lots usage (co-download)
behaviour from world-wide SFX (Ex
Libris Open URL resolver) logs and
apply collaborative filtering to
correlate articles.

Johan Bollen and Herbert Van de Sompel. An architecture for the
aggregation and analysis of scholarly usage data. (in JCDL2006)

TechLens+ Citation-Based Recommdendation
p2

References

Articles

p3

p5

R. Torres, S. McNee, M. Abel, J. Konstan, and J. Riedl. Enhancing Digital
Libraries with TechLens+. (in JCDL 2004)

Does “Rated” Citations w/ PageRank Help?
p1 p2 p3 p4 p5 p6 p7 p8 citations
p1 0.4 
p2 0.5 0.4
articles
p3 0.2 0.6

p4 0.7 0.5 
u1 0.5 0.3 0.6 
users
u2 0.2 0.3   = constant

Answer:

Using PageRank to “rate” citations is not signiﬁcantly

Better than using a constant (0/1)

Note:

There is ongoing work w/ NRC on machine learning method
for extracting “most important references” – that might help more

Sarkanto (NRC Article Recommender)
—  Uses TechLens+ strategy of replacing User-Item matrix with
Article-Article matrix from citation data
—  Uses TASTE recommender (now the recommendation
component of Mahout)
—  Is now decoupled from user-based recommender
—  Compare side by side w/ ‘bX’ recommendations
Try it here:

http://lab.cisti-icist.nrc-cnrc.gc.ca/Sarkanto/

Sarkanto compared w/ bX

“These are articles whose co- “Users who viewed this article also
citations are similar to this one.” viewed these articles.”

Experiments
—  Sarkanto generated ~ 1.9 million citation-based
recommendations (statically)
—  Experimental comparison done on 1886 randomly selected
articles from a subset of ~ 1.2M articles (down from ~ 8M)
—  Questions asked in the experiment:
—  How many recommendations produced by each recommender
—  Coverage (how often does a seed article generate a
recommendation)
—  How semantically diverse are the recommendations

Measuring Semantic Diversity

—  Question: what is the semantic distance between the source-
article and the recommendations?
—  In this setup it was not possible to compare the semantic distance
without the full-text for both set of recommendations
—  Full-text is available for the Sarkanto recommendations but not for
the bX recommendations

Journal-Journal Semantic Distance
—  Concatenate the full-text of all the articles in each journal
—  From a Lucene index of the full text in each journal, use
Dominic Widdows’ Semantic Vectors package to create
—  a term-journal matrix,
—  reduced dimensionality term-vectors (512) for each journal
using random projections
—  Apply multidimensional scaling (MDS) in R to obtain a 2-D
distance matrix (2300 x 2300)
G. Newton, A. Callahan, and M. Dumontier. Semantic journal mapping for

search visualization in a large scale article digital library in Second Workshop

on Very Large Digital Libraries, ECDL 2009

2-D Journal Distance Map
Colours clusters represent

Journal subject headings
(from publisher metadata)

http://cuvier.cisti.nrc.ca/~gnewton/torngat/applet.2009.07.22/index.html

Results: Diversity of Recommendations

—  ~13% of seed articles generated recommendations for both
bX and Sarkanto (i.e. not much overlap!)
—  Citation-based recommendations appear to be more
semantically diverse than User-based.

Conclusions
—  Citation-based and User-based recommendations are
complementary
—  Different kinds of data sources (users vs. citations) produce
different kinds of (non-overlapping) results
—  Citation-based recommendations are more semantically diverse
—  Hypothesis:“user-based recommendations may be biased by the semantic
similarity of search-engine results”

Usage-Based vs. Citation-Based Recommenders in a Digital Library

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Usage-Based vs. Citation-Based Recommenders in a Digital Library

Similar to Usage-Based vs. Citation-Based Recommenders in a Digital Library (20)

More from Andre Vellino

More from Andre Vellino (6)

Recently uploaded

Recently uploaded (20)

Usage-Based vs. Citation-Based Recommenders in a Digital Library