SlideShare a Scribd company logo
1 of 30
Download to read offline
Identification of Tweets
that Mention Books:
An Experimental Comparison
of Machine Learning Methods
Shuntaro Yada and Kyo Kageura
The University of Tokyo, Japan
ICADL2015, 10th December @South Korea
In this research...
• We examined four machine learning (ML)
methods with different combination of
features in order to identify Japanese Tweets
that Mention Books (TMBs)
2
Agenda
1. Introduction
2. Methodology
3. Experiments
4. Conclusion
3
1Introduction
• We are developing a book recommendation
system named Serendy 

(Shuntaro Yada, Development of a Book Recommendation
System to Inspire “Infrequent Readers”, ICADL2014)
• TMB identifier is the core technical module
of Serendy
Why do we identify TMBs?
5
What is Serendy?
Simulates the following situation online
6
a person discovers books by chance
through daily informal conversation
with friends or acquaintances
Twitter (SNS)
Background
The environment of reading or discovering
books is changing
• Personalisation of information consumption
• Increasing popularity of e-Books
➡ We have fewer opportunities to be exposed
to books by chance and/or unconsciously
7
Possible types of TMBs
TMBs could be tweets…
• Containing bibliographic information explicitly
• Including accurate bibliographic information
• Mentioning books incorrectly/casually
• Mentioning a book implicitly
8
Target type of TMBs
We focus on TMBs that contain full book titles
• They would be the majority of explicit TMBs
• We have comprehensive book titles
database
• This task allows us to gain insight to other
TMB types
9
Task definition (1)
• Identifying tweets containing full book titles can be
defined as a kind of named entity (NE) recognition task
A comprehensive book title list is available
• But identifying book title is not enough
• Noise: Book titles consist of ordinary expressions

(e.g., “Kidnapping”, “The Circle”)
• Spam: Tweets containing full book titles are often spam
10
Task definition (2)
Therefore we solve the task of identifying
TMBs as text classification:
➡ Classifying tweets that contain strings which
are the same as book titles (NE) into 

TMB/non-TMB (Noise + Spam)
11
2Methodology
Data set
• We gathered 70,844 tweets that contain only one
book title string
• Then we selected and manually annotated tweets
containing book titles appearing less than three
times in total: 8,528 tweets
13
TMB Spam Noise
350 4040 4138
Pipeline
The following two steps:
1. Spam filtering
2. TMB/Noise classification
14
TMB
Noise
Spam
Tweets containing a
full book title
ML methods
Four candidates of algorithms have been frequently
applied and have performed well in text classification
• Naïve Bayes
• MaxEnt (Maximum Entropy Modelling)
• SVM (Support Vector Machine)
• Random Forest
15
Features used for spam
filtering
• URL domain
• URL query keys
• SSL/TLS usage
• Hashtag count
• Following and follower counts
• Application name
16
tweet text
each URL
} }
http://www.amazon.co.jp/gp/product/xxx/ref=as_li_tf_tl (…)
&linkCode=as2&tag=xxx000-22
tweet metadata
}
Features used for TMB/Noise
classification
• URL domain
• Application name
• Tokens within tweets
❖ Part-Of-Speech tag
❖ NE abstraction
❖ Identifying tokens surrounding the title
17
Token
options
same as spam filtering
}
}
3Experiments
Experimental work
Two phases:
Phase 1: Finds the best combination of a ML
algorithm and its settings for each
step of the proposed pipeline
Phase 2: Evaluates our proposed method by
controlled experiments
19
Phase 1
Find the two best classifiers and their settings
for each step of the proposed pipeline
Experiment (a): Grid-search for spam filtering
Experiment (b): Grid-search for TMB/Noise
classification
Each best classifier will be used for the pipeline
20
• MaxEnt performed the best for spam
filtering
21
Phase 1: Experiment (a)
ML methods F-score
MaxEnt 0.978
Random Forest 0.937
Naïve Bayes 0.772
SVM 0.542
Phase 1: Experiment (b)
• SVM performed the best for TMB classification
• Tokens surrounding the title were effective
22
ML methods Token options F-score
SVM (Title) 0.699
MaxEnt (Title) 0.656
Naïve Bayes (Title, NEabst, POS) 0.577
Random Forest (Title, NEabst, POS) 0.503
Phase 2
Evaluate the proposed method against one-
step approach to identify TMBs
Experiment (c): Two-step pipeline 

(proposed method)
Experiment (d): Direct classification of TMB
(one-step approach)
23
Phase 2: Result
24
Experiment (d)
ML methods Precision Recall F-score
Random Forest 0.892 0.343 0.490
SVM 0.331 0.666 0.439
MaxEnt 0.291 0.161 0.389
Naïve Bayes 0.287 0.426 0.342
Experiment (c)
Method Precision Recall F-score
Two-step pipeline 0.698 0.681 0.686
4Conclusion
Summary (1)
• We tackled the task of identifying Japanese
tweets that mention books (TMBs)
• Starting from tweets that contain full book
titles based on a book title list, we proposed a
two-step pipeline:
➡ Spam filtering followed by TMB/Noise
classification
26
Summary (2)
• For spam filtering, MaxEnt performed the
best, while SVM performed the best for TMB/
Noise classification
• Two-step pipeline performed better than
direct (one-step) TMB classifiers
27
Outlooks
• Improve our pipeline by using tweet authors’
profiles
• Handle TMBs containing abbreviated book
titles
28
Applicability of this research
• This approach might be applicable to other
difficult types of NE recognition similar to
book title:
• Song
• Film
• TV program…
29
Thanks for your listening
• We tackled the task of identifying Japanese
tweets that mention books (TMB)
• Starting from tweets that contain full book
titles based on a book title list, we proposed a
two-step pipeline:
➡ Spam filtering followed by TMB/Noise
classification
Reprint of summary (1)

More Related Content

Similar to ICADL2015 Paper#2-3 Shuntaro Yada

Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learningananth
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categoriesWarNik Chow
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 
intership summary
intership summaryintership summary
intership summaryJunting Ma
 
Microposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on TwitterMicroposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on Twitterazubiaga
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
 
microposts2015presentation-150518124457-lva1-app6892.pdf
microposts2015presentation-150518124457-lva1-app6892.pdfmicroposts2015presentation-150518124457-lva1-app6892.pdf
microposts2015presentation-150518124457-lva1-app6892.pdfSunnySam26
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine ReadingIsabelle Augenstein
 
NLP_deep_learning_intro.pptx
NLP_deep_learning_intro.pptxNLP_deep_learning_intro.pptx
NLP_deep_learning_intro.pptxYasutoTamura1
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdfFEG
 
Automating Tinder w/ Eigenfaces and StanfordNLP
Automating Tinder w/ Eigenfaces and StanfordNLPAutomating Tinder w/ Eigenfaces and StanfordNLP
Automating Tinder w/ Eigenfaces and StanfordNLPJustin Long
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment AnalysisRupak Roy
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014StampedeCon
 
Major project presentation
Major project presentationMajor project presentation
Major project presentationGarima Ahuja
 
Who's to say what's funny? A computer using Language Models and Deep Learning...
Who's to say what's funny? A computer using Language Models and Deep Learning...Who's to say what's funny? A computer using Language Models and Deep Learning...
Who's to say what's funny? A computer using Language Models and Deep Learning...University of Minnesota, Duluth
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised LearningLukas Tencer
 

Similar to ICADL2015 Paper#2-3 Shuntaro Yada (20)

Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
 
InternshipReport
InternshipReportInternshipReport
InternshipReport
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
intership summary
intership summaryintership summary
intership summary
 
Microposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on TwitterMicroposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on Twitter
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
microposts2015presentation-150518124457-lva1-app6892.pdf
microposts2015presentation-150518124457-lva1-app6892.pdfmicroposts2015presentation-150518124457-lva1-app6892.pdf
microposts2015presentation-150518124457-lva1-app6892.pdf
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine Reading
 
Aaai 1
Aaai 1Aaai 1
Aaai 1
 
NLP_deep_learning_intro.pptx
NLP_deep_learning_intro.pptxNLP_deep_learning_intro.pptx
NLP_deep_learning_intro.pptx
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdf
 
Automating Tinder w/ Eigenfaces and StanfordNLP
Automating Tinder w/ Eigenfaces and StanfordNLPAutomating Tinder w/ Eigenfaces and StanfordNLP
Automating Tinder w/ Eigenfaces and StanfordNLP
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 
Major project presentation
Major project presentationMajor project presentation
Major project presentation
 
Who's to say what's funny? A computer using Language Models and Deep Learning...
Who's to say what's funny? A computer using Language Models and Deep Learning...Who's to say what's funny? A computer using Language Models and Deep Learning...
Who's to say what's funny? A computer using Language Models and Deep Learning...
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 

Recently uploaded

DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...Henrik Hanke
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxaryanv1753
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEMCharmi13
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comsaastr
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSebastiano Panichella
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationNathan Young
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRRsarwankumar4524
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxAsifArshad8
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸mathanramanathan2005
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.KathleenAnnCordero2
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this periodSaraIsabelJimenez
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGYpruthirajnayak525
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxJohnree4
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Escort Service
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSebastiano Panichella
 

Recently uploaded (20)

DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptx
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEM
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism Presentation
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this period
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptx
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 

ICADL2015 Paper#2-3 Shuntaro Yada

  • 1. Identification of Tweets that Mention Books: An Experimental Comparison of Machine Learning Methods Shuntaro Yada and Kyo Kageura The University of Tokyo, Japan ICADL2015, 10th December @South Korea
  • 2. In this research... • We examined four machine learning (ML) methods with different combination of features in order to identify Japanese Tweets that Mention Books (TMBs) 2
  • 3. Agenda 1. Introduction 2. Methodology 3. Experiments 4. Conclusion 3
  • 5. • We are developing a book recommendation system named Serendy 
 (Shuntaro Yada, Development of a Book Recommendation System to Inspire “Infrequent Readers”, ICADL2014) • TMB identifier is the core technical module of Serendy Why do we identify TMBs? 5
  • 6. What is Serendy? Simulates the following situation online 6 a person discovers books by chance through daily informal conversation with friends or acquaintances Twitter (SNS)
  • 7. Background The environment of reading or discovering books is changing • Personalisation of information consumption • Increasing popularity of e-Books ➡ We have fewer opportunities to be exposed to books by chance and/or unconsciously 7
  • 8. Possible types of TMBs TMBs could be tweets… • Containing bibliographic information explicitly • Including accurate bibliographic information • Mentioning books incorrectly/casually • Mentioning a book implicitly 8
  • 9. Target type of TMBs We focus on TMBs that contain full book titles • They would be the majority of explicit TMBs • We have comprehensive book titles database • This task allows us to gain insight to other TMB types 9
  • 10. Task definition (1) • Identifying tweets containing full book titles can be defined as a kind of named entity (NE) recognition task A comprehensive book title list is available • But identifying book title is not enough • Noise: Book titles consist of ordinary expressions
 (e.g., “Kidnapping”, “The Circle”) • Spam: Tweets containing full book titles are often spam 10
  • 11. Task definition (2) Therefore we solve the task of identifying TMBs as text classification: ➡ Classifying tweets that contain strings which are the same as book titles (NE) into 
 TMB/non-TMB (Noise + Spam) 11
  • 13. Data set • We gathered 70,844 tweets that contain only one book title string • Then we selected and manually annotated tweets containing book titles appearing less than three times in total: 8,528 tweets 13 TMB Spam Noise 350 4040 4138
  • 14. Pipeline The following two steps: 1. Spam filtering 2. TMB/Noise classification 14 TMB Noise Spam Tweets containing a full book title
  • 15. ML methods Four candidates of algorithms have been frequently applied and have performed well in text classification • Naïve Bayes • MaxEnt (Maximum Entropy Modelling) • SVM (Support Vector Machine) • Random Forest 15
  • 16. Features used for spam filtering • URL domain • URL query keys • SSL/TLS usage • Hashtag count • Following and follower counts • Application name 16 tweet text each URL } } http://www.amazon.co.jp/gp/product/xxx/ref=as_li_tf_tl (…) &linkCode=as2&tag=xxx000-22 tweet metadata }
  • 17. Features used for TMB/Noise classification • URL domain • Application name • Tokens within tweets ❖ Part-Of-Speech tag ❖ NE abstraction ❖ Identifying tokens surrounding the title 17 Token options same as spam filtering } }
  • 19. Experimental work Two phases: Phase 1: Finds the best combination of a ML algorithm and its settings for each step of the proposed pipeline Phase 2: Evaluates our proposed method by controlled experiments 19
  • 20. Phase 1 Find the two best classifiers and their settings for each step of the proposed pipeline Experiment (a): Grid-search for spam filtering Experiment (b): Grid-search for TMB/Noise classification Each best classifier will be used for the pipeline 20
  • 21. • MaxEnt performed the best for spam filtering 21 Phase 1: Experiment (a) ML methods F-score MaxEnt 0.978 Random Forest 0.937 Naïve Bayes 0.772 SVM 0.542
  • 22. Phase 1: Experiment (b) • SVM performed the best for TMB classification • Tokens surrounding the title were effective 22 ML methods Token options F-score SVM (Title) 0.699 MaxEnt (Title) 0.656 Naïve Bayes (Title, NEabst, POS) 0.577 Random Forest (Title, NEabst, POS) 0.503
  • 23. Phase 2 Evaluate the proposed method against one- step approach to identify TMBs Experiment (c): Two-step pipeline 
 (proposed method) Experiment (d): Direct classification of TMB (one-step approach) 23
  • 24. Phase 2: Result 24 Experiment (d) ML methods Precision Recall F-score Random Forest 0.892 0.343 0.490 SVM 0.331 0.666 0.439 MaxEnt 0.291 0.161 0.389 Naïve Bayes 0.287 0.426 0.342 Experiment (c) Method Precision Recall F-score Two-step pipeline 0.698 0.681 0.686
  • 26. Summary (1) • We tackled the task of identifying Japanese tweets that mention books (TMBs) • Starting from tweets that contain full book titles based on a book title list, we proposed a two-step pipeline: ➡ Spam filtering followed by TMB/Noise classification 26
  • 27. Summary (2) • For spam filtering, MaxEnt performed the best, while SVM performed the best for TMB/ Noise classification • Two-step pipeline performed better than direct (one-step) TMB classifiers 27
  • 28. Outlooks • Improve our pipeline by using tweet authors’ profiles • Handle TMBs containing abbreviated book titles 28
  • 29. Applicability of this research • This approach might be applicable to other difficult types of NE recognition similar to book title: • Song • Film • TV program… 29
  • 30. Thanks for your listening • We tackled the task of identifying Japanese tweets that mention books (TMB) • Starting from tweets that contain full book titles based on a book title list, we proposed a two-step pipeline: ➡ Spam filtering followed by TMB/Noise classification Reprint of summary (1)