SlideShare a Scribd company logo
1 of 22
Download to read offline
MovieTweetings: a Movie Rating
Dataset Collected From Twitter
@sidooms
Simon Dooms
Research datasets
 Recsys research needs datasets
 To evaluate, experiment and demonstrate
 I need datasets
Available for download:
 MovieLens 100K
 MovieLens 1M
 MovieLens 10M
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013 2
ConclusionResultsAbout DataTwitter - IMDbIntro
3Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
Research datasets
 Recsys research needs datasets
 To evaluate, experiment and demonstrate
 I needed datasets
Available for download:
 MovieLens 100K ~ most recent movie: 1998
 MovieLens 1M ~ most recent movie: 2000
 MovieLens 10M ~ most recent movie: 2008
I need up-to-date movie ratings
ConclusionResultsAbout DataTwitter - IMDbIntro
4Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
Finding data
 Data is all around us
5
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
6
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
7
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
8
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
Finding data
 Data is all around us
BUT extremely unstructured
 What we want:
1::122::5::838985046
1::185::5::838983525
1::231::5::838983392
1::292::5::838983421
1::316::5::838983392
(user, item, rating, time)
9
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
Structured data
10
ConclusionResultsAbout DataTwitter - IMDb
Intro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
Structured data
11
ConclusionResultsAbout DataTwitter - IMDb
Intro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
Structured data
12
ConclusionResultsAbout DataTwitter - IMDb
Intro
Structured data
13
ConclusionResultsAbout DataTwitter - IMDb
Intro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
Structured data
“I rated Death Proof 10/10 #IMDb”
• User
• Item (movie)
• Rating
• Hashtag
14
ConclusionResultsAbout DataTwitter - IMDb
Intro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
Structured data
Search Twitter for
“I rated #IMDb”
Bingo!
15
ConclusionResultsAbout DataTwitter - IMDb
Intro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
Collecting data
 We query the Twitter API for “I rated #IMDb”
 Extract relevant information
 Cross-reference with IMDb for extra genre data
16
ConclusionResultsAbout DataTwitter - IMDb
Intro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
The data
Ratings.dat
1::1074638::7::1365029107
1::1853728::8::1366576639
2::0113277::10::1379466669
Movies.dat
1028528::Death Proof (2007)::Action|Thriller
0133093::The Matrix (1999)::Action|Adventure|Sci-Fi
1670345::Now You See Me (2013)::Thriller
Users.dat
1::18405182
2::995885060
3::31260677
IMDb ID - http://www.imdb.com/title/tt0113277
Twitter ID (NOT @handle)
Rating scale from 1 to 10
17
ConclusionResultsAbout DataTwitter - IMDbIntro
Your data
 MovieTweetings dataset available on GitHub
(https://github.com/sidooms/MovieTweetings)
 Find it on the RecSys Wiki (category datasets)
Latest
 All ratings
 Automagically updated daily
Snapshots
 Fixed portion of dataset
 Added manually when appropriate
 10K, 20K, 30K, 40K, 50K, 100K
DISCLAIMER: Depending on Twitter API, IMDb apps and me!
18
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
Some numbers
MovieTweetings MovieLens 100K MovieLens 1M MovieLens 10M
Ratings 121,404 100,000 1,000,209 10,000,054
Users 19,464 943 6,040 71,567
Items 11,655 1682 3,900 10,681
19
(Results on September 30, 2013)
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
Some fun
Top 3 most rated movies
1. Iron Man 3 (2013)
2. Man of Steel (2013)
3. World War Z (2013)
Top 3 AVG rated movies (min 20 ratings)
1. The Shawshank Redemption (1994)
2. LOTR: The Return of the King (2003)
3. The Dark Knight (2008)
Bottom 3 worst AVG rated movies (min 20 ratings)
3. Scary MoVie (2013)
2. Piranha 3DD (2012)
1. Cosmopolis (2012)
20
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
Some conclusions
 Outdated public datasets
 Social media = Unstructured data available
 Structured rating data through Twitter – IMDb
 MovieTweetings: our Movie Rating Dataset
 Always up-to-date
 Includes most recent and most relevant movies
 Unfiltered rating data
 Publicly available
 Death Proof (2007) really is an awesome movie
21
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
@sidooms
Simon Dooms
MovieTweetings: a Movie Rating
Dataset Collected From Twitter

More Related Content

What's hot

自然言語処理で読み解く金融文書
自然言語処理で読み解く金融文書自然言語処理で読み解く金融文書
自然言語処理で読み解く金融文書Takahiro Kubo
 
ディープラーニングを用いた物体認識とその周辺 ~現状と課題~ (Revised on 18 July, 2018)
ディープラーニングを用いた物体認識とその周辺 ~現状と課題~ (Revised on 18 July, 2018)ディープラーニングを用いた物体認識とその周辺 ~現状と課題~ (Revised on 18 July, 2018)
ディープラーニングを用いた物体認識とその周辺 ~現状と課題~ (Revised on 18 July, 2018)Masakazu Iwamura
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectiveXavier Amatriain
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptxSadhanaParameswaran
 
Machine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora ExampleMachine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora ExampleXavier Amatriain
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
 
295B_Report_Sentiment_analysis
295B_Report_Sentiment_analysis295B_Report_Sentiment_analysis
295B_Report_Sentiment_analysisZahid Azam
 
【機械学習勉強会】画像の翻訳 ”Image-to-Image translation”
【機械学習勉強会】画像の翻訳 ”Image-to-Image translation” 【機械学習勉強会】画像の翻訳 ”Image-to-Image translation”
【機械学習勉強会】画像の翻訳 ”Image-to-Image translation” yoshitaka373
 
ChatGPTは思ったほど賢くない
ChatGPTは思ったほど賢くないChatGPTは思ったほど賢くない
ChatGPTは思ったほど賢くないCarnot Inc.
 
Security and Privacy of Machine Learning
Security and Privacy of Machine LearningSecurity and Privacy of Machine Learning
Security and Privacy of Machine LearningPriyanka Aash
 
失敗から学ぶ機械学習応用
失敗から学ぶ機械学習応用失敗から学ぶ機械学習応用
失敗から学ぶ機械学習応用Hiroyuki Masuda
 
Cohort Analysis at Scale
Cohort Analysis at ScaleCohort Analysis at Scale
Cohort Analysis at ScaleBlake Irvine
 
kaggle NFL 1st and Future - Impact Detection
kaggle NFL 1st and Future - Impact Detectionkaggle NFL 1st and Future - Impact Detection
kaggle NFL 1st and Future - Impact DetectionKazuyuki Miyazawa
 
Adversarial Attacks on A.I. Systems — NextCon, Jan 2019
Adversarial Attacks on A.I. Systems — NextCon, Jan 2019Adversarial Attacks on A.I. Systems — NextCon, Jan 2019
Adversarial Attacks on A.I. Systems — NextCon, Jan 2019anant90
 
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
Tutorial on People Recommendations in Social Networks -  ACM RecSys 2013,Hong...Tutorial on People Recommendations in Social Networks -  ACM RecSys 2013,Hong...
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...Anmol Bhasin
 
Pythonではじめるロケーションデータ解析
Pythonではじめるロケーションデータ解析Pythonではじめるロケーションデータ解析
Pythonではじめるロケーションデータ解析Hiroaki Sengoku
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsArun Kejariwal
 

What's hot (20)

自然言語処理で読み解く金融文書
自然言語処理で読み解く金融文書自然言語処理で読み解く金融文書
自然言語処理で読み解く金融文書
 
Data science
Data scienceData science
Data science
 
ディープラーニングを用いた物体認識とその周辺 ~現状と課題~ (Revised on 18 July, 2018)
ディープラーニングを用いた物体認識とその周辺 ~現状と課題~ (Revised on 18 July, 2018)ディープラーニングを用いた物体認識とその周辺 ~現状と課題~ (Revised on 18 July, 2018)
ディープラーニングを用いた物体認識とその周辺 ~現状と課題~ (Revised on 18 July, 2018)
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspective
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 
Machine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora ExampleMachine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora Example
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
295B_Report_Sentiment_analysis
295B_Report_Sentiment_analysis295B_Report_Sentiment_analysis
295B_Report_Sentiment_analysis
 
【機械学習勉強会】画像の翻訳 ”Image-to-Image translation”
【機械学習勉強会】画像の翻訳 ”Image-to-Image translation” 【機械学習勉強会】画像の翻訳 ”Image-to-Image translation”
【機械学習勉強会】画像の翻訳 ”Image-to-Image translation”
 
ChatGPTは思ったほど賢くない
ChatGPTは思ったほど賢くないChatGPTは思ったほど賢くない
ChatGPTは思ったほど賢くない
 
Security and Privacy of Machine Learning
Security and Privacy of Machine LearningSecurity and Privacy of Machine Learning
Security and Privacy of Machine Learning
 
失敗から学ぶ機械学習応用
失敗から学ぶ機械学習応用失敗から学ぶ機械学習応用
失敗から学ぶ機械学習応用
 
Security of Machine Learning
Security of Machine LearningSecurity of Machine Learning
Security of Machine Learning
 
Cohort Analysis at Scale
Cohort Analysis at ScaleCohort Analysis at Scale
Cohort Analysis at Scale
 
kaggle NFL 1st and Future - Impact Detection
kaggle NFL 1st and Future - Impact Detectionkaggle NFL 1st and Future - Impact Detection
kaggle NFL 1st and Future - Impact Detection
 
Adversarial Attacks on A.I. Systems — NextCon, Jan 2019
Adversarial Attacks on A.I. Systems — NextCon, Jan 2019Adversarial Attacks on A.I. Systems — NextCon, Jan 2019
Adversarial Attacks on A.I. Systems — NextCon, Jan 2019
 
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
Tutorial on People Recommendations in Social Networks -  ACM RecSys 2013,Hong...Tutorial on People Recommendations in Social Networks -  ACM RecSys 2013,Hong...
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
 
Pythonではじめるロケーションデータ解析
Pythonではじめるロケーションデータ解析Pythonではじめるロケーションデータ解析
Pythonではじめるロケーションデータ解析
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 

Viewers also liked

RecSys Challenge 2014 Workshop Introduction
RecSys Challenge 2014 Workshop IntroductionRecSys Challenge 2014 Workshop Introduction
RecSys Challenge 2014 Workshop IntroductionSimon Dooms
 
The MovieLens Datasets: History and Context
The MovieLens Datasets: History and ContextThe MovieLens Datasets: History and Context
The MovieLens Datasets: History and ContextMax Harper
 
Trust and Recommender Systems
Trust and  Recommender SystemsTrust and  Recommender Systems
Trust and Recommender Systemszhayefei
 
Social Movie Rating
Social Movie Rating Social Movie Rating
Social Movie Rating Xin Li
 
Turrin rec syschallenge_presentation_@recsys2014
Turrin rec syschallenge_presentation_@recsys2014Turrin rec syschallenge_presentation_@recsys2014
Turrin rec syschallenge_presentation_@recsys2014Roberto Turrin
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsYONG ZHENG
 
Online movie ticket booking
Online movie ticket bookingOnline movie ticket booking
Online movie ticket bookingmrinnovater007
 

Viewers also liked (9)

RecSys Challenge 2014 Workshop Introduction
RecSys Challenge 2014 Workshop IntroductionRecSys Challenge 2014 Workshop Introduction
RecSys Challenge 2014 Workshop Introduction
 
The MovieLens Datasets: History and Context
The MovieLens Datasets: History and ContextThe MovieLens Datasets: History and Context
The MovieLens Datasets: History and Context
 
Trust and Recommender Systems
Trust and  Recommender SystemsTrust and  Recommender Systems
Trust and Recommender Systems
 
Social Movie Rating
Social Movie Rating Social Movie Rating
Social Movie Rating
 
Turrin rec syschallenge_presentation_@recsys2014
Turrin rec syschallenge_presentation_@recsys2014Turrin rec syschallenge_presentation_@recsys2014
Turrin rec syschallenge_presentation_@recsys2014
 
B7 ppt
B7 pptB7 ppt
B7 ppt
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
Online movie ticket booking
Online movie ticket bookingOnline movie ticket booking
Online movie ticket booking
 

More from Simon Dooms

PhD Defense: Dynamic Generation of Personalized Hybrid Recommender Systems
PhD Defense: Dynamic Generation of Personalized Hybrid Recommender SystemsPhD Defense: Dynamic Generation of Personalized Hybrid Recommender Systems
PhD Defense: Dynamic Generation of Personalized Hybrid Recommender SystemsSimon Dooms
 
An online evaluation of explicit feedback mechanisms for recommender systems
An online evaluation of explicit feedback mechanisms for recommender systemsAn online evaluation of explicit feedback mechanisms for recommender systems
An online evaluation of explicit feedback mechanisms for recommender systemsSimon Dooms
 
Dynamic generation of personalized hybrid recommender systems
Dynamic generation of personalized hybrid recommender systemsDynamic generation of personalized hybrid recommender systems
Dynamic generation of personalized hybrid recommender systemsSimon Dooms
 
Improving IMDb Movie Recommendations with Interactive Settings and Filters
Improving IMDb Movie Recommendations with Interactive Settings and FiltersImproving IMDb Movie Recommendations with Interactive Settings and Filters
Improving IMDb Movie Recommendations with Interactive Settings and FiltersSimon Dooms
 
Mining Cross-Domain Rating Datasets from Structured Data on Twitter
Mining Cross-Domain Rating Datasets from Structured Data on TwitterMining Cross-Domain Rating Datasets from Structured Data on Twitter
Mining Cross-Domain Rating Datasets from Structured Data on TwitterSimon Dooms
 
Caching strategies for in memory neighborhood-based recommender systems
Caching strategies for in memory neighborhood-based recommender systemsCaching strategies for in memory neighborhood-based recommender systems
Caching strategies for in memory neighborhood-based recommender systemsSimon Dooms
 
A User-centric Evaluation of Recommender Algorithms for an Event Recommendati...
A User-centric Evaluation of Recommender Algorithms for an Event Recommendati...A User-centric Evaluation of Recommender Algorithms for an Event Recommendati...
A User-centric Evaluation of Recommender Algorithms for an Event Recommendati...Simon Dooms
 
A File-Based Approach for Recommender Systems in High-Performance Computing E...
A File-Based Approach for Recommender Systems in High-Performance Computing E...A File-Based Approach for Recommender Systems in High-Performance Computing E...
A File-Based Approach for Recommender Systems in High-Performance Computing E...Simon Dooms
 

More from Simon Dooms (8)

PhD Defense: Dynamic Generation of Personalized Hybrid Recommender Systems
PhD Defense: Dynamic Generation of Personalized Hybrid Recommender SystemsPhD Defense: Dynamic Generation of Personalized Hybrid Recommender Systems
PhD Defense: Dynamic Generation of Personalized Hybrid Recommender Systems
 
An online evaluation of explicit feedback mechanisms for recommender systems
An online evaluation of explicit feedback mechanisms for recommender systemsAn online evaluation of explicit feedback mechanisms for recommender systems
An online evaluation of explicit feedback mechanisms for recommender systems
 
Dynamic generation of personalized hybrid recommender systems
Dynamic generation of personalized hybrid recommender systemsDynamic generation of personalized hybrid recommender systems
Dynamic generation of personalized hybrid recommender systems
 
Improving IMDb Movie Recommendations with Interactive Settings and Filters
Improving IMDb Movie Recommendations with Interactive Settings and FiltersImproving IMDb Movie Recommendations with Interactive Settings and Filters
Improving IMDb Movie Recommendations with Interactive Settings and Filters
 
Mining Cross-Domain Rating Datasets from Structured Data on Twitter
Mining Cross-Domain Rating Datasets from Structured Data on TwitterMining Cross-Domain Rating Datasets from Structured Data on Twitter
Mining Cross-Domain Rating Datasets from Structured Data on Twitter
 
Caching strategies for in memory neighborhood-based recommender systems
Caching strategies for in memory neighborhood-based recommender systemsCaching strategies for in memory neighborhood-based recommender systems
Caching strategies for in memory neighborhood-based recommender systems
 
A User-centric Evaluation of Recommender Algorithms for an Event Recommendati...
A User-centric Evaluation of Recommender Algorithms for an Event Recommendati...A User-centric Evaluation of Recommender Algorithms for an Event Recommendati...
A User-centric Evaluation of Recommender Algorithms for an Event Recommendati...
 
A File-Based Approach for Recommender Systems in High-Performance Computing E...
A File-Based Approach for Recommender Systems in High-Performance Computing E...A File-Based Approach for Recommender Systems in High-Performance Computing E...
A File-Based Approach for Recommender Systems in High-Performance Computing E...
 

Recently uploaded

99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdfPaige Cruz
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
The Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementThe Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementNuwan Dias
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5DianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimizationarrow10202532yuvraj
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideHironori Washizaki
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Juan Carlos Gonzalez
 

Recently uploaded (20)

99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
The Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementThe Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API Management
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?
 

MovieTweetings: a movie rating dataset collected from twitter

  • 1. MovieTweetings: a Movie Rating Dataset Collected From Twitter @sidooms Simon Dooms
  • 2. Research datasets  Recsys research needs datasets  To evaluate, experiment and demonstrate  I need datasets Available for download:  MovieLens 100K  MovieLens 1M  MovieLens 10M ConclusionResultsAbout DataTwitter - IMDbIntro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013 2
  • 3. ConclusionResultsAbout DataTwitter - IMDbIntro 3Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 4. Research datasets  Recsys research needs datasets  To evaluate, experiment and demonstrate  I needed datasets Available for download:  MovieLens 100K ~ most recent movie: 1998  MovieLens 1M ~ most recent movie: 2000  MovieLens 10M ~ most recent movie: 2008 I need up-to-date movie ratings ConclusionResultsAbout DataTwitter - IMDbIntro 4Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 5. Finding data  Data is all around us 5 ConclusionResultsAbout DataTwitter - IMDbIntro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 6. 6 ConclusionResultsAbout DataTwitter - IMDbIntro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 7. 7 ConclusionResultsAbout DataTwitter - IMDbIntro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 8. 8 ConclusionResultsAbout DataTwitter - IMDbIntro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 9. Finding data  Data is all around us BUT extremely unstructured  What we want: 1::122::5::838985046 1::185::5::838983525 1::231::5::838983392 1::292::5::838983421 1::316::5::838983392 (user, item, rating, time) 9 ConclusionResultsAbout DataTwitter - IMDbIntro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 10. Structured data 10 ConclusionResultsAbout DataTwitter - IMDb Intro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 11. Structured data 11 ConclusionResultsAbout DataTwitter - IMDb Intro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 13. Structured data 13 ConclusionResultsAbout DataTwitter - IMDb Intro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 14. Structured data “I rated Death Proof 10/10 #IMDb” • User • Item (movie) • Rating • Hashtag 14 ConclusionResultsAbout DataTwitter - IMDb Intro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 15. Structured data Search Twitter for “I rated #IMDb” Bingo! 15 ConclusionResultsAbout DataTwitter - IMDb Intro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 16. Collecting data  We query the Twitter API for “I rated #IMDb”  Extract relevant information  Cross-reference with IMDb for extra genre data 16 ConclusionResultsAbout DataTwitter - IMDb Intro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 17. The data Ratings.dat 1::1074638::7::1365029107 1::1853728::8::1366576639 2::0113277::10::1379466669 Movies.dat 1028528::Death Proof (2007)::Action|Thriller 0133093::The Matrix (1999)::Action|Adventure|Sci-Fi 1670345::Now You See Me (2013)::Thriller Users.dat 1::18405182 2::995885060 3::31260677 IMDb ID - http://www.imdb.com/title/tt0113277 Twitter ID (NOT @handle) Rating scale from 1 to 10 17 ConclusionResultsAbout DataTwitter - IMDbIntro
  • 18. Your data  MovieTweetings dataset available on GitHub (https://github.com/sidooms/MovieTweetings)  Find it on the RecSys Wiki (category datasets) Latest  All ratings  Automagically updated daily Snapshots  Fixed portion of dataset  Added manually when appropriate  10K, 20K, 30K, 40K, 50K, 100K DISCLAIMER: Depending on Twitter API, IMDb apps and me! 18 ConclusionResultsAbout DataTwitter - IMDbIntro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 19. Some numbers MovieTweetings MovieLens 100K MovieLens 1M MovieLens 10M Ratings 121,404 100,000 1,000,209 10,000,054 Users 19,464 943 6,040 71,567 Items 11,655 1682 3,900 10,681 19 (Results on September 30, 2013) ConclusionResultsAbout DataTwitter - IMDbIntro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 20. Some fun Top 3 most rated movies 1. Iron Man 3 (2013) 2. Man of Steel (2013) 3. World War Z (2013) Top 3 AVG rated movies (min 20 ratings) 1. The Shawshank Redemption (1994) 2. LOTR: The Return of the King (2003) 3. The Dark Knight (2008) Bottom 3 worst AVG rated movies (min 20 ratings) 3. Scary MoVie (2013) 2. Piranha 3DD (2012) 1. Cosmopolis (2012) 20 ConclusionResultsAbout DataTwitter - IMDbIntro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 21. Some conclusions  Outdated public datasets  Social media = Unstructured data available  Structured rating data through Twitter – IMDb  MovieTweetings: our Movie Rating Dataset  Always up-to-date  Includes most recent and most relevant movies  Unfiltered rating data  Publicly available  Death Proof (2007) really is an awesome movie 21 ConclusionResultsAbout DataTwitter - IMDbIntro Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
  • 22. @sidooms Simon Dooms MovieTweetings: a Movie Rating Dataset Collected From Twitter

Editor's Notes

  1. I am Simon Dooms from Ghent University, Belgium and I will be presenting you the MovieTweetings dataset which is a Movie Rating dataset collected from Twitter.
  2. Elephant in the room, research loves datasets. Especially recsys research needs datasets, we need it to evaluate our algorithms, do experimentation and also when we want to demonstrate our final recommender systems, we need data to drive the engines.I am no different, my research also needed datasets. For my PhD I am working with hybrid recommender systems and I focus on the movie domain because movies are fun. So I needed data to test out new configurations and algorithms and did what we all do … download the movielens dataset (which comes in three sizes) and insert it into the system. Experiments went well, evaluations were okay, but then I started visually inspecting the end results (so the recommendation lists) of my system.
  3. This is what I got. I should really watch Braveheart, Forrest Gump and Liar Liar. Three very good suggestions but they also illustrate a system. Because I use old datasets I can only recommend old movies. This is not a problem for my personal experiments and offline evaluations. I can calculate all the RMSE I want, but this IS a problem when I want to take my system out of the lab and show it to actual users, maybe run some user-centric experiments.
  4. We should be able to recommend new and interesting movies, but when I inspected the datasets I was working with, I realized that was impossible. When we use the Movielens 100K dataset, we are in fact working with data that is 15 years old. So the most recent movies we can recommend are Blade and ‘Saving Private Ryan’…The bigger MovieLens datasets are somewhat more recent, but still, even 2008 is 5 years ago. The year of the first twilight movie, and the first ‘Iron Man’.So if I want to build a recommender system that produces relevant results, I need up-to-date movie ratings.
  5. So I started to look for rating data. And luckily for me, in these modern times we are living in … data is all around us.
  6. For example take this movie IMDB page. While we get all kinds of information on the movie, there is also preference information to be found, like the fact that the movie is in a top 5000 list, has a total rating of 7.1 , more than 7000 people liked it on Facebook, it had some nominations … and so on.
  7. For another example we go to Facebook, search for the same movie, and this page comes up. Again some basic information about the movie, but also rating information like: more than 300 thousand people who liked this movie/topic. I can click on this link and I get a new screen listing those 300 thousand users.
  8. Yet another source is Twitter, when I search for tweets containing my movie title, I get lists like this one. All tweets contain the movie title, but in fact only two are actual opinions about the movie. Some are rather neutral or just accidentally happen to contain the movie title, like this second one here.
  9. So data is all around us … But it is extremely unstructured and hard to interpret.What we want is a nice list of users, expressing numerical ratings for items with timestamps. So we restart our quest for data and this time we focus on structured data.
  10. Eventually we found our holy grail in the social share feature integrated in IMDb. You see them everywhere on the web nowadays, the ‘share’ button allowing you to advertise content to your social network. Very often when you click on these things, the original website already makes a suggestion as to what you should write. And luckily for us, IMDb has a very interesting suggestion…
  11. At least it does so for its mobile client apps. They have an app for every major platform, but I have an iPhone, so we will be taking the iPhone tour.
  12. I am on my iPhone and I start the IMDb app… I get this homescreen. It allows me to search for movies, so I search for my movie and get this screen…Again, just like the on the website, we see some basic information and the option to rate this movie… Now I click the rate this link
  13. …and get to the rating screen where I can select my rating. And most importantly, I can choose to share my rating.After saving I get the option to post to Twitter….
  14. ….which brings me to the most interesting screenshot. The IMDb app pre-formats my tweet in a structured way. ‘I rated Death Proof 10 out of 10 hashtag #IMDB’. So this tweet actually contains all we need to know, it has a user, item, rating and a hashtag making it easier for us to find the tweets.
  15. Now to find structured ratings, all we need to do, is go to Twitter and find all tweets containing ‘I rated’ and the hashtag #IMDB. E voila, behold the jackpot of ratings. Now all tweet results are relevant ratings and contain all the information we need to build ourselves an interesting rating dataset.
  16. On a daily basis we query the Twitter API for tweets containing ‘I rated #IMDB’ and we extract the relevant information. We cross-reference this with the IMDb page to provide also some extra genre data just like MovieLens does.
  17. The end result of our efforts is three files ratings, movies and users. In the Rating file we have users ids, itemids, ratings and timestamps presented in the MovieLens style to make the dataset compatible with code working on MovieLens data.Note however that the ratings are on a 1 to 10 scale as is custom for IMDB, and not 1 to 5 as MovieLens.For item id we use the unique IMDB id which can direct us easily to the relevant IMDB information page by adding the suffix.The movies file contains again much like the MovieLens dataset, some basic info on the movie like title, year and genresThen finally the user file, in this file we make the connection between the internal user id we used in our ratings file and the true Twitter ID of the user. We use the ID and not the username handle because handles can be changed, but the user id will always remain the same.
  18. I use this dataset for my own research, but I figured it could probaly be interesting for the entire recsys community and so I made the dataset available online through the GitHub Platform. Information about the dataset is also added to the RecSys wiki, so you can find the dataset in a number of ways.The data itself is made available in two formats, latest and snapshots. The latest repository will always contain all the data and is automagically updated daily.And there are the snaphots which are just fixed portions of the dataset to make it easier to repeat experiments and refer to the dataset in research. Currently we have snapshots of 10K up to 100K ratings.Little disclaimer I have to add. This continuation of this dataset is currently depending on the Twitter API, the functionality of the IMDb apps and my effort and time. I will do my best to maintain this as long as possible but there is no way of knowing how long that will be.
  19. Okay time for some numbers, we started building this dataset 7 months ago and this is how much ratings we have gathered since then. Currently we are adding between 500 and 600 new ratings to the dataset each day and so at the current pace we have collected about 120K ratings.If we compare numbers with MovieLens, we can see that our data is much sparser because of the high number of users and items contained in the dataset. Our dataset is unfiltered so we also have users with less than 20 ratings.
  20. Time to wrap up and conclude this presentation.We started with the notion that public datasets are still very often used in research, but they are becoming outdated and fail to incorporate new and relevant items.Lots of data could be found in social media, but almost always dubious and unstructured, so hard to use in our systems.We found structured data through the social share features of the IMDB platform and build ourselves a new movie rating dataset based on that.The dataset is updated daily…will therefore always contain the most recent and relevant movies …provides unfiltered rating data…and is publicly available….And last but not least, you should really watch the movie Death Proof, it is awesome. Thank you.