SlideShare a Scribd company logo
1 of 36
The MovieLens Datasets:
History and Context
Max Harper (presenter)
Joe Konstan
2
http://tiis.acm.org/iui16/
MovieLens: 5 star movie ratings
userId,movieId,rating,timestamp
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
1,50,3.5,1112484580
1,112,3.5,1094785740
1,151,4.0,1094785734
1,223,4.0,1112485573
1,253,4.0,1112484940
...
138493,69644,3.0,1260209457
138493,70286,5.0,1258126944
138493,71619,2.5,1255811136
3
web site: dataset:
ratings data is interesting, intuitive,
and pervasive
4
dataset impact
» 140,000 downloads in 2014
» a search for “movielens” yields
• 6,020 results in Google Books
• 8,920 results in Google Scholar
5
dataset uses
» research
» technical: programming books + blogs
» educational (including a MOOC)
» industrial R&D, demos
6
overview
» MovieLens datasets overview
» dataset stability, system change
7
8
<user, movie, rating, timestamp>
9
<user, movie, rating, timestamp>
<Max, Toy Story, 4.0, 2010-12-01 12:00:00>
MovieLens benchmark datasets
10
Name Dates Users Movies Ratings Density
ML 100K ‘97 – ‘98 943 1,682 100,000 6.30%
ML 1M ‘00 – ‘03 6,040 3,706 1,000,209 4.47%
ML 10M ‘95 – ‘09 69,878 10,681 10,000,054 1.34%
ML 20M ‘95 – ‘15 138,493 27,278 20,000,263 0.54%
designed for replicability
MovieLens latest datasets
11
Name Dates Users Movies Ratings Density
ML Latest ‘95 – ‘16 247,753 34,208 22,884,377 0.003%
ML Latest
Small
‘96 – ‘16 668 10,329 105,339 0.015%
designed for recency
overview
» MovieLens datasets overview
» dataset stability, system change
12
tension: datasets vs. system
» ideal (pure) vs. actual (it’s complex)
» systems want to change
• stay current, constant improvements
• A/B tests, beta testing, and other experiments
» context changes
• devices, competing sites, changing user base
13
14
15
16
17
18
some key changes
» core flow of browse/search
» rating widget
» recommender
» new user experience
» …
19
history of experiments
» both online field experiments and online
lab experiments
» created temporary and permanent
changes, changed pattern of use
20
21
in the paper
» the story of MovieLens (1997 origins)
• lessons learned from running a “real” system
in a research lab
• lots of fun descriptive stats/charts
» best practices for dataset researchers
• limitations
• alternatives
22
people who made this possible
» John Riedl
» Istvan Albert, Al Borchers, Dan Cosley,
Brent J. Dahlen, Rich Davies, Michael
Ekstrand, Dan Frankowski, Nathaniel
Good, Jon Herlocker, Daniel Kluver,
Shyong (Tony) Lam, Michael Ludwig,
Sean McNee, Chad Salvatore, Shilad Sen,
and Loren Terveen
» MovieLens users
23
in ACM Transactions on Interactive Intelligent Systems, Dec. 2015
» feedback? contact us: grouplens-info@cs.umn.edu
presented by Max Harper, Research Scientist, University of Minnesota,
harper@cs.umn.edu
written with Joe Konstan, Distinguished McKnight University Professor,
University of Minnesota, konstan@cs.umn.edu
This material is based on work supported by the National Science Foundation under grants
DGE-9554517, IIS-9613960, IIS-9734442, IIS-9978717, EIA-9986042, IIS-0102229, IIS-
0324851, IIS-0534420, IIS-0808692, IIS-0964695, IIS-0968483, IIS-1017697, IIS-1210863.
This project was also supported by the University of Minnesota’s Undergraduate Research
Opportunities Program and by grants and/or gifts from Net Perceptions, Inc., CFK Productions,
and Google.
24
The MovieLens Datasets:
History and Context
25
26
version 0 (1997) version 4 (2014)
one solution
» document change, include with datasets
27
key dataset limitations (1/2)
» system UI and recommender changes
» bias towards “successful” users
» possible bias towards users with tolerance
for “research quality” design
» timestamps do not reflect time of
consumption
28
key dataset limitations (2/2)
» recommender systems research
community attitudes
• implicit behaviors > ratings?
• dataset-only research increasingly
discouraged
29
30
MovieLens system evolution
key changes and experiments
31
alternative datasets
32
Name Domain Rating Scale Ratings Density
Book-
Crossing
books 0 - 10 1.1m 0.003%
EachMovie movies 0 - 14 2.7m 2.872%
Jester
(dataset1)
jokes -10 - 10 4.1m 57.463%
Amazon many 1 - 5 82.8m < 0.001%
Netflix Prize movies 1 - 5 100.5m 1.178%
Yahoo Music
(C15)
music (various) 0 - 100 262.8m 0.042%
33
EachMovie
lessons from running MovieLens
» lessons from startups apply (it’s hard, fail
fast)
» continual work, not one-time effort
» encourage code quality through good
social coding conventions
» invest in tools that allow users to help
34
dataset uses
» recommender systems research
» recommender systems MOOC
• http://coursera.org/learn/recommender-systems
» code examples (popular press, blogs)
» higher education
» commercial – internal testing
35
36

More Related Content

What's hot

VJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCNVJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCNDat Nguyen
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsShunta Saito
 
Sentiment analysis in Twitter on Big Data
Sentiment analysis in Twitter on Big DataSentiment analysis in Twitter on Big Data
Sentiment analysis in Twitter on Big DataIswarya M
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
 
Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용Kyunghoon Kim
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsYONG ZHENG
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringViet-Trung TRAN
 
Machine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyMachine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyAlexandros Karatzoglou
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Spatial data mining
Spatial data miningSpatial data mining
Spatial data miningMITS Gwalior
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Massimo Quadrana
 
Overview of recommender system
Overview of recommender systemOverview of recommender system
Overview of recommender systemStanley Wang
 
Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)inside-BigData.com
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...Edureka!
 
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...Alejandro Bellogin
 

What's hot (20)

VJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCNVJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCN
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
Sentiment analysis in Twitter on Big Data
Sentiment analysis in Twitter on Big DataSentiment analysis in Twitter on Big Data
Sentiment analysis in Twitter on Big Data
 
Lstm
LstmLstm
Lstm
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 
Temporal based Recommendation System
Temporal based Recommendation SystemTemporal based Recommendation System
Temporal based Recommendation System
 
Rnn and lstm
Rnn and lstmRnn and lstm
Rnn and lstm
 
Machine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyMachine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 Sydney
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Spatial data mining
Spatial data miningSpatial data mining
Spatial data mining
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
 
Overview of recommender system
Overview of recommender systemOverview of recommender system
Overview of recommender system
 
Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
 

Viewers also liked

MovieTweetings: a movie rating dataset collected from twitter
MovieTweetings: a movie rating dataset collected from twitterMovieTweetings: a movie rating dataset collected from twitter
MovieTweetings: a movie rating dataset collected from twitterSimon Dooms
 
RecSys Challenge 2014 Workshop Introduction
RecSys Challenge 2014 Workshop IntroductionRecSys Challenge 2014 Workshop Introduction
RecSys Challenge 2014 Workshop IntroductionSimon Dooms
 
Turrin rec syschallenge_presentation_@recsys2014
Turrin rec syschallenge_presentation_@recsys2014Turrin rec syschallenge_presentation_@recsys2014
Turrin rec syschallenge_presentation_@recsys2014Roberto Turrin
 
Trust and Recommender Systems
Trust and  Recommender SystemsTrust and  Recommender Systems
Trust and Recommender Systemszhayefei
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 

Viewers also liked (6)

MovieTweetings: a movie rating dataset collected from twitter
MovieTweetings: a movie rating dataset collected from twitterMovieTweetings: a movie rating dataset collected from twitter
MovieTweetings: a movie rating dataset collected from twitter
 
RecSys Challenge 2014 Workshop Introduction
RecSys Challenge 2014 Workshop IntroductionRecSys Challenge 2014 Workshop Introduction
RecSys Challenge 2014 Workshop Introduction
 
Turrin rec syschallenge_presentation_@recsys2014
Turrin rec syschallenge_presentation_@recsys2014Turrin rec syschallenge_presentation_@recsys2014
Turrin rec syschallenge_presentation_@recsys2014
 
B7 ppt
B7 pptB7 ppt
B7 ppt
 
Trust and Recommender Systems
Trust and  Recommender SystemsTrust and  Recommender Systems
Trust and Recommender Systems
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 

Similar to The MovieLens Datasets: History and Context

Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Data visualisations: drawing actionable insights from science and technology ...
Data visualisations: drawing actionable insights from science and technology ...Data visualisations: drawing actionable insights from science and technology ...
Data visualisations: drawing actionable insights from science and technology ...EFSA EU
 
Effects of Network Structure, Competition and Memory Time on Social Spreading...
Effects of Network Structure, Competition and Memory Time on Social Spreading...Effects of Network Structure, Competition and Memory Time on Social Spreading...
Effects of Network Structure, Competition and Memory Time on Social Spreading...James Gleeson
 
Citizen Sensor Data Mining, Social Media Analytics and Applications
Citizen Sensor Data Mining, Social Media Analytics and ApplicationsCitizen Sensor Data Mining, Social Media Analytics and Applications
Citizen Sensor Data Mining, Social Media Analytics and ApplicationsAmit Sheth
 
Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Symeon Papadopoulos
 
Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Eleftherios Spyromitros-Xioufis
 
Enhancing Soft Power: using cyberspace to enhance Soft Power
Enhancing Soft Power: using cyberspace to enhance Soft PowerEnhancing Soft Power: using cyberspace to enhance Soft Power
Enhancing Soft Power: using cyberspace to enhance Soft PowerAmit Sheth
 
CHI2015 - Citizen Science || Zooniverse
CHI2015 - Citizen Science || ZooniverseCHI2015 - Citizen Science || Zooniverse
CHI2015 - Citizen Science || ZooniverseRamine Tinati
 
What's up at Kno.e.sis?
What's up at Kno.e.sis? What's up at Kno.e.sis?
What's up at Kno.e.sis? Amit Sheth
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
 
New and Emerging Forms of Data
New and Emerging Forms of DataNew and Emerging Forms of Data
New and Emerging Forms of DataDavid De Roure
 
6. Work6 Social Distancing.pptx
6. Work6 Social Distancing.pptx6. Work6 Social Distancing.pptx
6. Work6 Social Distancing.pptxVanditha11
 
Big data divided (24 march2014)
Big data divided (24 march2014)Big data divided (24 march2014)
Big data divided (24 march2014)Han Woo PARK
 
supporting communities in an increasingly decentralized biomedical research e...
supporting communities in an increasingly decentralized biomedical research e...supporting communities in an increasingly decentralized biomedical research e...
supporting communities in an increasingly decentralized biomedical research e...Brian Bot
 
My Dissertation Defense
My Dissertation Defense My Dissertation Defense
My Dissertation Defense Laura Pasquini
 
Foresight Analytics
Foresight AnalyticsForesight Analytics
Foresight Analyticssuresh sood
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Jeffrey Nichols
 
Learning to Classify Users in Online Interaction Networks
Learning to Classify Users in Online Interaction NetworksLearning to Classify Users in Online Interaction Networks
Learning to Classify Users in Online Interaction NetworksSymeon Papadopoulos
 

Similar to The MovieLens Datasets: History and Context (20)

Web and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sisWeb and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sis
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Data visualisations: drawing actionable insights from science and technology ...
Data visualisations: drawing actionable insights from science and technology ...Data visualisations: drawing actionable insights from science and technology ...
Data visualisations: drawing actionable insights from science and technology ...
 
Effects of Network Structure, Competition and Memory Time on Social Spreading...
Effects of Network Structure, Competition and Memory Time on Social Spreading...Effects of Network Structure, Competition and Memory Time on Social Spreading...
Effects of Network Structure, Competition and Memory Time on Social Spreading...
 
Citizen Sensor Data Mining, Social Media Analytics and Applications
Citizen Sensor Data Mining, Social Media Analytics and ApplicationsCitizen Sensor Data Mining, Social Media Analytics and Applications
Citizen Sensor Data Mining, Social Media Analytics and Applications
 
Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...
 
Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...
 
Enhancing Soft Power: using cyberspace to enhance Soft Power
Enhancing Soft Power: using cyberspace to enhance Soft PowerEnhancing Soft Power: using cyberspace to enhance Soft Power
Enhancing Soft Power: using cyberspace to enhance Soft Power
 
CHI2015 - Citizen Science || Zooniverse
CHI2015 - Citizen Science || ZooniverseCHI2015 - Citizen Science || Zooniverse
CHI2015 - Citizen Science || Zooniverse
 
What's up at Kno.e.sis?
What's up at Kno.e.sis? What's up at Kno.e.sis?
What's up at Kno.e.sis?
 
Social Network Analysis Applications and Approach
Social Network Analysis Applications and ApproachSocial Network Analysis Applications and Approach
Social Network Analysis Applications and Approach
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
New and Emerging Forms of Data
New and Emerging Forms of DataNew and Emerging Forms of Data
New and Emerging Forms of Data
 
6. Work6 Social Distancing.pptx
6. Work6 Social Distancing.pptx6. Work6 Social Distancing.pptx
6. Work6 Social Distancing.pptx
 
Big data divided (24 march2014)
Big data divided (24 march2014)Big data divided (24 march2014)
Big data divided (24 march2014)
 
supporting communities in an increasingly decentralized biomedical research e...
supporting communities in an increasingly decentralized biomedical research e...supporting communities in an increasingly decentralized biomedical research e...
supporting communities in an increasingly decentralized biomedical research e...
 
My Dissertation Defense
My Dissertation Defense My Dissertation Defense
My Dissertation Defense
 
Foresight Analytics
Foresight AnalyticsForesight Analytics
Foresight Analytics
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
 
Learning to Classify Users in Online Interaction Networks
Learning to Classify Users in Online Interaction NetworksLearning to Classify Users in Online Interaction Networks
Learning to Classify Users in Online Interaction Networks
 

Recently uploaded

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)itwameryclare
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 

Recently uploaded (20)

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 

The MovieLens Datasets: History and Context

  • 1. The MovieLens Datasets: History and Context Max Harper (presenter) Joe Konstan
  • 3. MovieLens: 5 star movie ratings userId,movieId,rating,timestamp 1,2,3.5,1112486027 1,29,3.5,1112484676 1,32,3.5,1112484819 1,47,3.5,1112484727 1,50,3.5,1112484580 1,112,3.5,1094785740 1,151,4.0,1094785734 1,223,4.0,1112485573 1,253,4.0,1112484940 ... 138493,69644,3.0,1260209457 138493,70286,5.0,1258126944 138493,71619,2.5,1255811136 3 web site: dataset:
  • 4. ratings data is interesting, intuitive, and pervasive 4
  • 5. dataset impact » 140,000 downloads in 2014 » a search for “movielens” yields • 6,020 results in Google Books • 8,920 results in Google Scholar 5
  • 6. dataset uses » research » technical: programming books + blogs » educational (including a MOOC) » industrial R&D, demos 6
  • 7. overview » MovieLens datasets overview » dataset stability, system change 7
  • 9. 9 <user, movie, rating, timestamp> <Max, Toy Story, 4.0, 2010-12-01 12:00:00>
  • 10. MovieLens benchmark datasets 10 Name Dates Users Movies Ratings Density ML 100K ‘97 – ‘98 943 1,682 100,000 6.30% ML 1M ‘00 – ‘03 6,040 3,706 1,000,209 4.47% ML 10M ‘95 – ‘09 69,878 10,681 10,000,054 1.34% ML 20M ‘95 – ‘15 138,493 27,278 20,000,263 0.54% designed for replicability
  • 11. MovieLens latest datasets 11 Name Dates Users Movies Ratings Density ML Latest ‘95 – ‘16 247,753 34,208 22,884,377 0.003% ML Latest Small ‘96 – ‘16 668 10,329 105,339 0.015% designed for recency
  • 12. overview » MovieLens datasets overview » dataset stability, system change 12
  • 13. tension: datasets vs. system » ideal (pure) vs. actual (it’s complex) » systems want to change • stay current, constant improvements • A/B tests, beta testing, and other experiments » context changes • devices, competing sites, changing user base 13
  • 14. 14
  • 15. 15
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. some key changes » core flow of browse/search » rating widget » recommender » new user experience » … 19
  • 20. history of experiments » both online field experiments and online lab experiments » created temporary and permanent changes, changed pattern of use 20
  • 21. 21
  • 22. in the paper » the story of MovieLens (1997 origins) • lessons learned from running a “real” system in a research lab • lots of fun descriptive stats/charts » best practices for dataset researchers • limitations • alternatives 22
  • 23. people who made this possible » John Riedl » Istvan Albert, Al Borchers, Dan Cosley, Brent J. Dahlen, Rich Davies, Michael Ekstrand, Dan Frankowski, Nathaniel Good, Jon Herlocker, Daniel Kluver, Shyong (Tony) Lam, Michael Ludwig, Sean McNee, Chad Salvatore, Shilad Sen, and Loren Terveen » MovieLens users 23
  • 24. in ACM Transactions on Interactive Intelligent Systems, Dec. 2015 » feedback? contact us: grouplens-info@cs.umn.edu presented by Max Harper, Research Scientist, University of Minnesota, harper@cs.umn.edu written with Joe Konstan, Distinguished McKnight University Professor, University of Minnesota, konstan@cs.umn.edu This material is based on work supported by the National Science Foundation under grants DGE-9554517, IIS-9613960, IIS-9734442, IIS-9978717, EIA-9986042, IIS-0102229, IIS- 0324851, IIS-0534420, IIS-0808692, IIS-0964695, IIS-0968483, IIS-1017697, IIS-1210863. This project was also supported by the University of Minnesota’s Undergraduate Research Opportunities Program and by grants and/or gifts from Net Perceptions, Inc., CFK Productions, and Google. 24 The MovieLens Datasets: History and Context
  • 25. 25
  • 26. 26 version 0 (1997) version 4 (2014)
  • 27. one solution » document change, include with datasets 27
  • 28. key dataset limitations (1/2) » system UI and recommender changes » bias towards “successful” users » possible bias towards users with tolerance for “research quality” design » timestamps do not reflect time of consumption 28
  • 29. key dataset limitations (2/2) » recommender systems research community attitudes • implicit behaviors > ratings? • dataset-only research increasingly discouraged 29
  • 30. 30
  • 31. MovieLens system evolution key changes and experiments 31
  • 32. alternative datasets 32 Name Domain Rating Scale Ratings Density Book- Crossing books 0 - 10 1.1m 0.003% EachMovie movies 0 - 14 2.7m 2.872% Jester (dataset1) jokes -10 - 10 4.1m 57.463% Amazon many 1 - 5 82.8m < 0.001% Netflix Prize movies 1 - 5 100.5m 1.178% Yahoo Music (C15) music (various) 0 - 100 262.8m 0.042%
  • 34. lessons from running MovieLens » lessons from startups apply (it’s hard, fail fast) » continual work, not one-time effort » encourage code quality through good social coding conventions » invest in tools that allow users to help 34
  • 35. dataset uses » recommender systems research » recommender systems MOOC • http://coursera.org/learn/recommender-systems » code examples (popular press, blogs) » higher education » commercial – internal testing 35
  • 36. 36

Editor's Notes

  1. I am the current caretaker of a system called movielens, and the datasets that are derived from that system I'm here to present a paper that we published in Transactions on Interactive Intelligent Systems about movielens and the movielens datasets Notes: what is the point? why should I listen to this talk? why are you telling us this? theme: tension building/maintaining a real system vs. producing a “pure” dataset - a solution (impossible to implement retroactively) is to document extensively (e.g., add version number to each rating) there are many other things that changed beyond the ones listed in the current talk…mention them briefly? add a road-map at the beginning of the talk - maybe “things to know if you use the movielens datasets” include most cited papers (+1) don’t say specifics about recommenders – just say how high level effect might have influenced ratings why are you telling us movielens history? we’re sharing these lessons because we think they’re useful for users and people who want to generate their own datasets say as a theme: the system changes and that has impact on dataset? mention genome and other grouplens datasets?
  2. MovieLens is a web site that collects 5-star ratings on movies We have collected the result of many users providing many of these movie ratings in the movielens datasets, a publicly available resource for folks to explore rating data Notes: possibly convert to a data table
  3. Fundamentally, movielens is relevant because ratings-based systems have become so prevalent across a variety of systems (maybe cut this slide?)
  4. most of these books and papers refer to the datasets, rather than the system Notes: just say “mooc”
  5. most of these books and papers refer to the datasets, rather than the system Notes: just say “mooc”
  6. 2 goals in this talk. introduce the MovieLens datasets to make sure everyone knows what I’m talking about, and to catch some of you up on new releases discuss the tension between system-building and dataset purity, which I hope will be useful both to inform us about some potential limitations inherent in dataset-based research and to inform researchers engaged in releasing datasets of their own --- relevance to IUI folks who… conduct dataset research peer review dataset research build systems release datasets
  7. fundamentally, the MovieLens datasets describe users’ movie rating behavior the core of the dataset contains tuples of the form shown here.
  8. for example: user Max rated the movie Toy Story 4 stars at a particular time rating values represent “half-star” ratings, from 0.5 stars to 5 stars timestamps represent the most recent time when the rating was provided In our latest dataset, there are about 20 million records like these
  9. here are the four dataset versions we’ve released one about every five years they vary quite a bit in their characteristics the older datasets are most useful for comparing new work to existing published studies we recommend that new work that is not comparative uses the 20m dataset
  10. for development or educational work, we have released a set of non-stable “latest” datasets kept up to date (generated in 2016 to include new movies) latest is unabridged, containing all users, including those with just 1 rating latest-small is kept to 100k ratings for speed of development and testing, designed for educational purposes, demos, and other needs that don’t require big data latest-small is also redistributable for non-commercial purposes
  11. ideal: “pure” datasets actual: user-generated datasets come from user interaction with a system these changes work against the concept of generating pure data movielens is a good case study, since it has been around for so long
  12. Here it is! This is movielens, circa August 1997, around the time of its launch, as rendered by netscape navigator 4 MovieLens has operated continuously since that time. Let’s look though some screenshots showing its evolution
  13. version 1, released september 1999
  14. version 2, released February 2000
  15. version 3, released February 2003
  16. and most recently, version 4, released November 2014 and this basically what it will look like if you visit today
  17. core flow of browse/search rating widget half stars, number of clicks recommender prediction, ordering new user experience “entry barrier”, initial personalization there’s more: tagging, movie management, social features, … recommender (1997 user-user via grouplens, 1999 user-user net perceptions, 2003 item-item multilens, 2012 item-item lenskit, 2014 popularity blending item-item or svd) new user (1997 rate 5 from 10 at a time (9 random, 1 easy), 2002 rate 15 selected for popularity, 2014 pick groups recommender) ratings widget (1997 5 stars dropdown, 2003 half stars pulldown, 2014 clickable stars) Notes: more visuals too much here
  18. …not unique to MovieLens, practice of A/B testing affects most datasets (e.g., Netflix, Amazon)
  19. and yet we find remarkable stability in general use of the ratings widget in aggregation chart shows average and median ratings across time, aggregated by month. given the extent of changes we’ve just discussed, it is somewhat remarkable to observe so little monthly variation Notes: get rid of median line?
  20. a brief acknowledgement of the people who made this retrospective look possible
  21. the core idea or premise hasn’t really changed since its initial release! movielens is a system that helps people find movies to watch it works by asking users to rate movies to express their preferences in 1-5 stars it uses those ratings to predict subsequent ratings it can prioritize the display of highly-predicted ratings to personalize the experience
  22. Notes: polish presentation of timestamps + influence
  23. usage movielens has been used by lots of people, all around the world we’ve registered about 280,000 people since launching in 1997 and the system has welcomed several thousand monthly active users since 2001 Notes: maybe combine with other chart?
  24. To understand the datasets, it is critical to understand the underlying system Like all systems, movielens has changed Like many systems, movielens has experimented with features
  25. there are a variety of other datasets that provide different characteristics this table shows some of the most prominent ones the two biggest alternatives in the movies space, eachmovie and netflix, have each been redacted and are no longer available, legally speaking however, there are a number of great alternatives for ratings data across other domains Notes: Maybe cut this slide explain the cross-outs
  26. Let’s go back to the mid-90’s Digital Equipment Corporation (DEC) was running an experimental system called EachMovie EachMovie was built to explore the still young idea of personalized recommendations with collaborative filtering But in 1997, DEC decided to shut down EachMovie The DEC researchers reached out to the recommender systems community, looking for an organization to develop a replacement site, to serve the same users Joe Konstan and John Riedl (pictured here) responded, and had their graduate students build a “copy” of eachMovie, backed by the grouplens recommender engine
  27. our paper has links to all of those, if you’re interested!