SlideShare a Scribd company logo
1 of 15
Download to read offline
Usage-Based vs. Citation-Based
Recommenders in a Digital Library

                   André Vellino 	

          School of Information Studies 
               University of Ottawa
       blog: http://synthese.wordpress.com
                 twitter: @vellino	

          e-mail: avellino@uottawa.ca
Context
—  Canada Institute for Scientific and Technical Information
  (aka Canada’s National Science Library)
—  Has a full-text digital collection (Scientific, Technical,
  Medical) with text-mining rights for research purposes only
  —  Elsevier and Springer (mostly)
      —  ~8M articles
      —  ~2800 journals
      —  ~ 3TB
—  Plan: a Hybrid, Multi-Dimensional
  —  Usage-based (CF)
  —  Content-based (CBF)
  —  User-Context
Sparsity of Usage Data is a Problem in
Digital Libraries
    Amazon             Digital Libraries
 Users       Items                Items
                      Users




                      ~70,000



~ 70 M   ~ 93 M                  ~7M
Data is Sparse Too

                                  edges user-item graph 	

—  Sparseness of a dataset S =
                                  total number of possible edges	


—  Mendeley data           S = 2.66 x 10-05
—  Neflix                  S = 1.18 x 10-02

—  But also, Mendeley data isn’t “highly connected”
   —  83.6% of Mendeley articles were referenced by only 1 user
   —   6% of the articles were referenced by 3 or more users.
(2009)	




ExLibris bX solution to data sparsity:
   Harvest lots usage (co-download)
   behaviour from world-wide SFX (Ex
   Libris Open URL resolver) logs and
   apply collaborative filtering to
   correlate articles.




     Johan Bollen and Herbert Van de Sompel. An architecture for the
     aggregation and analysis of scholarly usage data. (in JCDL2006)
TechLens+ Citation-Based Recommdendation
          p2	

                                                                              References	

                                                       Articles	





p3	


p5	




        R. Torres, S. McNee, M. Abel, J. Konstan, and J. Riedl. Enhancing Digital
        Libraries with TechLens+. (in JCDL 2004)
Does “Rated” Citations w/ PageRank Help?
                       p1 p2 p3 p4 p5 p6 p7 p8                         citations
                  p1                         0.4         
                  p2             0.5         0.4
   articles
                  p3   0.2                         0.6

                  p4         0.7 0.5                          
                  u1             0.5 0.3           0.6        
   users
                  u2   0.2             0.3                        = constant

Answer:	

    Using PageRank to “rate” citations is not significantly 	

    Better than using a constant (0/1)	

Note:	

    There is ongoing work w/ NRC on machine learning method 
    for extracting “most important references” – that might help more
Sarkanto (NRC Article Recommender)
—  Uses TechLens+ strategy of replacing User-Item matrix with
    Article-Article matrix from citation data
—  Uses TASTE recommender (now the recommendation
    component of Mahout)
—  Is now decoupled from user-based recommender
—  Compare side by side w/ ‘bX’ recommendations
Try it here:

     http://lab.cisti-icist.nrc-cnrc.gc.ca/Sarkanto/
Sarkanto compared w/ bX




“These are articles whose co-         “Users who viewed this article also
citations are similar to this one.”   viewed these articles.”
Experiments
—  Sarkanto generated ~ 1.9 million citation-based
    recommendations (statically)
—  Experimental comparison done on 1886 randomly selected
    articles from a subset of ~ 1.2M articles (down from ~ 8M)
—  Questions asked in the experiment:
  —  How many recommendations produced by each recommender
  —  Coverage (how often does a seed article generate a
      recommendation)
  —  How semantically diverse are the recommendations
Measuring Semantic Diversity




—  Question: what is the semantic distance between the source-
    article and the recommendations?
—  In this setup it was not possible to compare the semantic distance
    without the full-text for both set of recommendations
—  Full-text is available for the Sarkanto recommendations but not for
    the bX recommendations
Journal-Journal Semantic Distance
  —  Concatenate the full-text of all the articles in each journal
  —  From a Lucene index of the full text in each journal, use
     Dominic Widdows’ Semantic Vectors package to create
     —  a term-journal matrix,
     —  reduced dimensionality term-vectors (512) for each journal
        using random projections
  —  Apply multidimensional scaling (MDS) in R to obtain a 2-D
     distance matrix (2300 x 2300)
G. Newton, A. Callahan, and M. Dumontier. Semantic journal mapping for 	

search visualization in a large scale article digital library in Second Workshop 	

on Very Large Digital Libraries, ECDL 2009
2-D Journal Distance Map
                                              Colours clusters represent	

                                              Journal subject headings
                                              (from publisher metadata)	





http://cuvier.cisti.nrc.ca/~gnewton/torngat/applet.2009.07.22/index.html
Results: Diversity of Recommendations

—  ~13% of seed articles generated recommendations for both
    bX and Sarkanto (i.e. not much overlap!)
—  Citation-based recommendations appear to be more
    semantically diverse than User-based.
Conclusions
—  Citation-based and User-based recommendations are
    complementary
—  Different kinds of data sources (users vs. citations) produce
    different kinds of (non-overlapping) results
—  Citation-based recommendations are more semantically diverse
    —  Hypothesis:“user-based recommendations may be biased by the semantic
        similarity of search-engine results”

More Related Content

What's hot

Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016Anita de Waard
 
Sharing Sensitive Data With Confidence: The DataTags system
Sharing Sensitive Data With Confidence: The DataTags systemSharing Sensitive Data With Confidence: The DataTags system
Sharing Sensitive Data With Confidence: The DataTags systemMichael Bar-Sinai
 
NIH BD2K DataMed metadata model - Force11, 2016
NIH BD2K DataMed metadata model - Force11, 2016NIH BD2K DataMed metadata model - Force11, 2016
NIH BD2K DataMed metadata model - Force11, 2016Susanna-Assunta Sansone
 
Annotopia open annotation services platform
Annotopia open annotation services platformAnnotopia open annotation services platform
Annotopia open annotation services platformTim Clark
 
Overview of the NIH BD2K CEDAR centre, on metadata and standards
Overview of the NIH BD2K CEDAR centre, on metadata and standardsOverview of the NIH BD2K CEDAR centre, on metadata and standards
Overview of the NIH BD2K CEDAR centre, on metadata and standardsSusanna-Assunta Sansone
 
The Dataverse Commons
The Dataverse CommonsThe Dataverse Commons
The Dataverse CommonsMerce Crosas
 
Networked Science, And Integrating with Dataverse
Networked Science, And Integrating with DataverseNetworked Science, And Integrating with Dataverse
Networked Science, And Integrating with DataverseAnita de Waard
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Sciencedrnigam
 
Fairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsFairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsTim Clark
 
On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...Susanna-Assunta Sansone
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...Susanna-Assunta Sansone
 
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Merce Crosas
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsAnita de Waard
 
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...Open Knowledge Maps
 
Data Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost RecoveryData Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost RecoveryAnita de Waard
 
eXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic ExperimentseXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic ExperimentsTim Clark
 
exFrame: a Semantic Web Platform for Genomics Experiments
exFrame: a Semantic Web Platform for Genomics ExperimentsexFrame: a Semantic Web Platform for Genomics Experiments
exFrame: a Semantic Web Platform for Genomics ExperimentsTim Clark
 

What's hot (20)

Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016
 
Sharing Sensitive Data With Confidence: The DataTags system
Sharing Sensitive Data With Confidence: The DataTags systemSharing Sensitive Data With Confidence: The DataTags system
Sharing Sensitive Data With Confidence: The DataTags system
 
Martone acs presentation
Martone acs presentationMartone acs presentation
Martone acs presentation
 
Neuroscience as networked science
Neuroscience as networked scienceNeuroscience as networked science
Neuroscience as networked science
 
NIH BD2K DataMed metadata model - Force11, 2016
NIH BD2K DataMed metadata model - Force11, 2016NIH BD2K DataMed metadata model - Force11, 2016
NIH BD2K DataMed metadata model - Force11, 2016
 
Annotopia open annotation services platform
Annotopia open annotation services platformAnnotopia open annotation services platform
Annotopia open annotation services platform
 
Overview of the NIH BD2K CEDAR centre, on metadata and standards
Overview of the NIH BD2K CEDAR centre, on metadata and standardsOverview of the NIH BD2K CEDAR centre, on metadata and standards
Overview of the NIH BD2K CEDAR centre, on metadata and standards
 
Ngsp
NgspNgsp
Ngsp
 
The Dataverse Commons
The Dataverse CommonsThe Dataverse Commons
The Dataverse Commons
 
Networked Science, And Integrating with Dataverse
Networked Science, And Integrating with DataverseNetworked Science, And Integrating with Dataverse
Networked Science, And Integrating with Dataverse
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Fairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsFairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology views
 
On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
 
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
 
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...
 
Data Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost RecoveryData Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost Recovery
 
eXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic ExperimentseXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic Experiments
 
exFrame: a Semantic Web Platform for Genomics Experiments
exFrame: a Semantic Web Platform for Genomics ExperimentsexFrame: a Semantic Web Platform for Genomics Experiments
exFrame: a Semantic Web Platform for Genomics Experiments
 

Similar to Usage-Based vs. Citation-Based Recommenders in a Digital Library

MS-Presentation-new template arid university.pptx
MS-Presentation-new template arid university.pptxMS-Presentation-new template arid university.pptx
MS-Presentation-new template arid university.pptxNimraTariq69
 
Scientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewScientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewAngelo Salatino
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
 
From Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de BellisFrom Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de BellisXanat V. Meza
 
Text Mining from Three Perspectives - Publisher
Text Mining from Three Perspectives - PublisherText Mining from Three Perspectives - Publisher
Text Mining from Three Perspectives - Publisherjudsondunham
 
The paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecologyThe paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecologyR. John Robertson
 
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleMendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleKris Jack
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects Carole Goble
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data toIJwest
 
INSC580MacasaOpenSourceSoftwareLibrariesFall2016
INSC580MacasaOpenSourceSoftwareLibrariesFall2016INSC580MacasaOpenSourceSoftwareLibrariesFall2016
INSC580MacasaOpenSourceSoftwareLibrariesFall2016Michael J. Macasa
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...Marko Rodriguez
 
Towards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial FindingsTowards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial Findingsalc28
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...Armin Haller
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicinePaul Groth
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webFabien Gandon
 

Similar to Usage-Based vs. Citation-Based Recommenders in a Digital Library (20)

A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
MS-Presentation-new template arid university.pptx
MS-Presentation-new template arid university.pptxMS-Presentation-new template arid university.pptx
MS-Presentation-new template arid university.pptx
 
Scientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewScientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an Overview
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
From Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de BellisFrom Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
 
Text Mining from Three Perspectives - Publisher
Text Mining from Three Perspectives - PublisherText Mining from Three Perspectives - Publisher
Text Mining from Three Perspectives - Publisher
 
The paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecologyThe paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecology
 
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleMendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scale
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data to
 
New age
New ageNew age
New age
 
INSC580MacasaOpenSourceSoftwareLibrariesFall2016
INSC580MacasaOpenSourceSoftwareLibrariesFall2016INSC580MacasaOpenSourceSoftwareLibrariesFall2016
INSC580MacasaOpenSourceSoftwareLibrariesFall2016
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
Price "KBART: improving the supply of data to link resolvers and knowledge ba...
Price "KBART: improving the supply of data to link resolvers and knowledge ba...Price "KBART: improving the supply of data to link resolvers and knowledge ba...
Price "KBART: improving the supply of data to link resolvers and knowledge ba...
 
Price "KBART: Improving the Supply of Data to Link Resolvers and Knowledge Ba...
Price "KBART: Improving the Supply of Data to Link Resolvers and Knowledge Ba...Price "KBART: Improving the Supply of Data to Link Resolvers and Knowledge Ba...
Price "KBART: Improving the Supply of Data to Link Resolvers and Knowledge Ba...
 
Towards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial FindingsTowards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial Findings
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
 

More from Andre Vellino

Why machines can't think (logically)
Why machines can't think (logically)Why machines can't think (logically)
Why machines can't think (logically)Andre Vellino
 
Measuring academic influence: Not all citations are equal
Measuring academic influence: Not all citations are equalMeasuring academic influence: Not all citations are equal
Measuring academic influence: Not all citations are equalAndre Vellino
 
Vellino presentationtocisti
Vellino presentationtocistiVellino presentationtocisti
Vellino presentationtocistiAndre Vellino
 
Mechanical Librarian
Mechanical LibrarianMechanical Librarian
Mechanical LibrarianAndre Vellino
 
La recommandation d'articles scientifiques dans une bibliothèque numérique
La recommandation d'articles scientifiques dans une bibliothèque numériqueLa recommandation d'articles scientifiques dans une bibliothèque numérique
La recommandation d'articles scientifiques dans une bibliothèque numériqueAndre Vellino
 
Synthese Recommender System
Synthese Recommender SystemSynthese Recommender System
Synthese Recommender SystemAndre Vellino
 

More from Andre Vellino (6)

Why machines can't think (logically)
Why machines can't think (logically)Why machines can't think (logically)
Why machines can't think (logically)
 
Measuring academic influence: Not all citations are equal
Measuring academic influence: Not all citations are equalMeasuring academic influence: Not all citations are equal
Measuring academic influence: Not all citations are equal
 
Vellino presentationtocisti
Vellino presentationtocistiVellino presentationtocisti
Vellino presentationtocisti
 
Mechanical Librarian
Mechanical LibrarianMechanical Librarian
Mechanical Librarian
 
La recommandation d'articles scientifiques dans une bibliothèque numérique
La recommandation d'articles scientifiques dans une bibliothèque numériqueLa recommandation d'articles scientifiques dans une bibliothèque numérique
La recommandation d'articles scientifiques dans une bibliothèque numérique
 
Synthese Recommender System
Synthese Recommender SystemSynthese Recommender System
Synthese Recommender System
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Usage-Based vs. Citation-Based Recommenders in a Digital Library

  • 1. Usage-Based vs. Citation-Based Recommenders in a Digital Library André Vellino School of Information Studies  University of Ottawa blog: http://synthese.wordpress.com twitter: @vellino e-mail: avellino@uottawa.ca
  • 2. Context —  Canada Institute for Scientific and Technical Information (aka Canada’s National Science Library) —  Has a full-text digital collection (Scientific, Technical, Medical) with text-mining rights for research purposes only —  Elsevier and Springer (mostly) —  ~8M articles —  ~2800 journals —  ~ 3TB —  Plan: a Hybrid, Multi-Dimensional —  Usage-based (CF) —  Content-based (CBF) —  User-Context
  • 3. Sparsity of Usage Data is a Problem in Digital Libraries Amazon Digital Libraries Users Items Items Users ~70,000 ~ 70 M ~ 93 M ~7M
  • 4. Data is Sparse Too edges user-item graph —  Sparseness of a dataset S = total number of possible edges —  Mendeley data S = 2.66 x 10-05 —  Neflix S = 1.18 x 10-02 —  But also, Mendeley data isn’t “highly connected” —  83.6% of Mendeley articles were referenced by only 1 user —  6% of the articles were referenced by 3 or more users.
  • 5. (2009) ExLibris bX solution to data sparsity: Harvest lots usage (co-download) behaviour from world-wide SFX (Ex Libris Open URL resolver) logs and apply collaborative filtering to correlate articles. Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. (in JCDL2006)
  • 6. TechLens+ Citation-Based Recommdendation p2 References Articles p3 p5 R. Torres, S. McNee, M. Abel, J. Konstan, and J. Riedl. Enhancing Digital Libraries with TechLens+. (in JCDL 2004)
  • 7. Does “Rated” Citations w/ PageRank Help? p1 p2 p3 p4 p5 p6 p7 p8 citations p1 0.4  p2 0.5 0.4 articles p3 0.2 0.6 p4 0.7 0.5  u1 0.5 0.3 0.6  users u2 0.2 0.3   = constant Answer: Using PageRank to “rate” citations is not significantly Better than using a constant (0/1) Note: There is ongoing work w/ NRC on machine learning method for extracting “most important references” – that might help more
  • 8. Sarkanto (NRC Article Recommender) —  Uses TechLens+ strategy of replacing User-Item matrix with Article-Article matrix from citation data —  Uses TASTE recommender (now the recommendation component of Mahout) —  Is now decoupled from user-based recommender —  Compare side by side w/ ‘bX’ recommendations Try it here: http://lab.cisti-icist.nrc-cnrc.gc.ca/Sarkanto/
  • 9. Sarkanto compared w/ bX “These are articles whose co- “Users who viewed this article also citations are similar to this one.” viewed these articles.”
  • 10. Experiments —  Sarkanto generated ~ 1.9 million citation-based recommendations (statically) —  Experimental comparison done on 1886 randomly selected articles from a subset of ~ 1.2M articles (down from ~ 8M) —  Questions asked in the experiment: —  How many recommendations produced by each recommender —  Coverage (how often does a seed article generate a recommendation) —  How semantically diverse are the recommendations
  • 11. Measuring Semantic Diversity —  Question: what is the semantic distance between the source- article and the recommendations? —  In this setup it was not possible to compare the semantic distance without the full-text for both set of recommendations —  Full-text is available for the Sarkanto recommendations but not for the bX recommendations
  • 12. Journal-Journal Semantic Distance —  Concatenate the full-text of all the articles in each journal —  From a Lucene index of the full text in each journal, use Dominic Widdows’ Semantic Vectors package to create —  a term-journal matrix, —  reduced dimensionality term-vectors (512) for each journal using random projections —  Apply multidimensional scaling (MDS) in R to obtain a 2-D distance matrix (2300 x 2300) G. Newton, A. Callahan, and M. Dumontier. Semantic journal mapping for search visualization in a large scale article digital library in Second Workshop on Very Large Digital Libraries, ECDL 2009
  • 13. 2-D Journal Distance Map Colours clusters represent Journal subject headings (from publisher metadata) http://cuvier.cisti.nrc.ca/~gnewton/torngat/applet.2009.07.22/index.html
  • 14. Results: Diversity of Recommendations —  ~13% of seed articles generated recommendations for both bX and Sarkanto (i.e. not much overlap!) —  Citation-based recommendations appear to be more semantically diverse than User-based.
  • 15. Conclusions —  Citation-based and User-based recommendations are complementary —  Different kinds of data sources (users vs. citations) produce different kinds of (non-overlapping) results —  Citation-based recommendations are more semantically diverse —  Hypothesis:“user-based recommendations may be biased by the semantic similarity of search-engine results”