SlideShare a Scribd company logo
1 of 24
SpeakerLDA: Discovering Topics
in Transcribed Multi-Speaker
Audio Contents
Damiano Spina, Johanne R. Trippas, Lawrence Cavedon, Mark Sanderson
An Extreme Example: Discussing about
‘Merengue’ (Spanish)
{dance, egg, whip, Terpsichore,
Latin, America, white, dessert}
{dance, Terpsichore, Latin,
America}
{dessert, whip, white, egg}
What is the dialogue about?
Not considering speakers Considering speakersVS.
Hypothesis
Considering information about speakers—which words/fragments
correspond to each speaker—would improve topic discovery
Example: Topic Discovery for Recommendation
{dance, Terpsichore,
Latin, America}
{dessert, whip, white,
egg}
More Like This More content about
dance
More content about
desserts
Topic Discovery in Multi-Speaker
Audio Contents: Applications
• Multi-Speaker Audio Contents:
• Podcasts (news, shows, interviews, etc.)
• Meetings
• TV programs
• Applications:
• Content-based Recommendation: ‘more like this’
• Clustering
• Group search results according to topics
• E.g., Search Result Presentation
Research Question
What is the impact in terms of effectiveness of
adding speaker information to a topic model
when compared to traditional approaches (i.e., LDA)?
Topic Discovery
[Image from Blei, D. Probabilistic Topic Models, Communication of the ACM, 2012]
Distribution of
topics over words
Distribution of
topics over
documents
Topic Discovery vs. Topic Segmentation
Topic Discovery Topic Segmentation
• Characterizes how a
conversation evolves over time
in terms of topics
• 1 document ~ sequence of
topics
• Characterizes documents
according to topics
• 1 document ~ distribution of
topics
t1 t3 t2 t3 t2 t1
time
t1 t2 t3
Topic Discovery vs. Topic Segmentation
Topic Discovery Topic Segmentation
Not using speaker information Latent Dirichlet Allocation (LDA)
[Blei et al., 2003]
TextTiling [Hearst, 1997]
[Purver et al. 2006]
Using speaker information ? SITS [Nguyen et al., 2012]
Topic Discovery vs. Topic Segmentation
Topic Discovery Topic Segmentation
Not using speaker information Latent Dirichlet Allocation (LDA)
[Blei et al., 2003]
TextTiling [Hearst, 1997]
[Purver et al. 2006]
Using speaker information SpeakerLDA SITS [Nguyen et al., 2012]
RQ
RQ'
Proposed Approach: SpeakerLDA
• Split documents (D) according to speakers (S)
• Run LDA
• Combine topic distributions obtained for each speaker’s pseudo-
document ds
qds
Proposed Approach: SpeakerLDA
Evaluation Framework
• Topic models are typically evaluated by
(i) computing intrinsic metrics (e.g., perplexity) of the the model in an unseen
set of documents or
(ii) being applied to external information access tasks (e.g., topic detection as
a clustering task)
• Needs manually annotated ground truth
• One possible measure: Precision/Recall of clustering relationships
Evaluation Framework II
• Is there any test collection suitable for measuring differences
between our approach and existent topic models?
• Must satisfy following conditions
A. Each topic is discussed in two or more documents
B. Include spoken documents with two or more speakers
The AMI Corpus satisfies both conditions!
The AMI Corpus
• Augmented Multi-Party Interaction (AMI) Corpus
• 100 hours of recorded audio
• More than 100 meetings with multiple speakers (generally 4)
• Real and elicited scenario-driven meetings
• Speakers play different roles:
• Interface designer, project manager, industrial designer, marketing
• Manual transcriptions, including speaker segmentation
• Transcripts segmented according to topics and subtopics
Generating a Gold Standard for Topic
Discovery
Work in Progress
• Compare the effectiveness of SpeakerLDA vs. LDA (and vs. topic
segmentation approaches)
• Extrinsic Evaluation: compare system outputs to clustering gold
standard
0.0
0.2
0.4
0.6
0.8
0.25 0.50 0.75 1.00
Sensitivity (BCubed Recall)
Reliability(BCubedPrecision)
system
LDA
SpeakerLDA
• AMI Corpus
• Topic Segmentation
annotations as clustering
gold standard
• Varying initial number of
topics
• Considering the n most
frequent topics in the
topic-document
distribution for topic
assignment
Work in Progress
• Compare the effectiveness of SpeakerLDA vs. LDA (and vs. topic
segmentation approaches)
• Extrinsic Evaluation: compare system outputs to clustering gold
standard
• Challenge: How to define a valid clustering gold standard from topic
segmentation annotations?
• Opportunity: Compare system output to topic distribution gold standard.
• Generate distributions from annotated segments
{closing=0.09, opening=0.03, components...=0.21, discussion=0.06,
industrial...=0.21, interface…=0.21, marketing...=0.20}
Gold topic distribution for the meeting IS1008c:
Conclusions
• We propose SpeakerLDA, a topic model that takes into account
speaker information to discover what a set of audio documents (such
as podcasts) is about
• It can be used for clustering search results or content-based
recommendation (´more like this´)
• We are currently investigating how to generate a clustering gold
standard from topic segmentation annotations in the AMI Corpus
• Evaluate topic models by comparing against a topic distribution gold
standard?
Thank you!
- For dessert we have...'Merengue'!
SpeakerLDA: Discovering Topics
in Transcribed Multi-Speaker
Audio Contents
Damiano Spina, Johanne R. Trippas, Lawrence Cavedon, Mark Sanderson
@damiano10
damiano.spina@rmit.edu.au

More Related Content

Viewers also liked

Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Krishna Bollojula
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)muzzy4friends
 
Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Kyunghoon Kim
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_wordszukun
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationChristoph Trattner
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...Christos Katsanos
 
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Ra'Fat Al-Msie'deen
 
SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011aneeshabakharia
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionCory Andrew Henson
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureRakuten Group, Inc.
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML, Inc
 
An approach to source code plagiarism
An approach to source code plagiarismAn approach to source code plagiarism
An approach to source code plagiarismvarsha_bhat
 
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesBayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesJinYeong Bak
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003Ajay Ohri
 
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiHow to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiSocial Media Camp
 
Latent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text SummarizationLatent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text SummarizationElaheh Barati
 

Viewers also liked (20)

Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)
 
Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_words
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...
 
Practical Machine Learning
Practical Machine Learning Practical Machine Learning
Practical Machine Learning
 
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
 
SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine Perception
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet Mixture
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 Release
 
An approach to source code plagiarism
An approach to source code plagiarismAn approach to source code plagiarism
An approach to source code plagiarism
 
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesBayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiHow to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
 
Latent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text SummarizationLatent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text Summarization
 
Naive Bayes | Statistics
Naive Bayes | StatisticsNaive Bayes | Statistics
Naive Bayes | Statistics
 

Similar to SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ SLAM 2015

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
 
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015Vrije Universiteit Amsterdam
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringTraian Rebedea
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Sergey Sosnovsky
 
Introduction to Text Mining and Topic Modelling
Introduction to Text Mining and Topic ModellingIntroduction to Text Mining and Topic Modelling
Introduction to Text Mining and Topic ModellingDavid Paule
 
Introduction to NVivo
Introduction to NVivoIntroduction to NVivo
Introduction to NVivoMarieke Guy
 
TSL3133 Topic 11 Qualitative Data Analysis
TSL3133 Topic 11 Qualitative Data AnalysisTSL3133 Topic 11 Qualitative Data Analysis
TSL3133 Topic 11 Qualitative Data AnalysisYee Bee Choo
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections
 
Data Description Registry Interoperability WG at Research Data Alliance Third...
Data Description Registry Interoperability WG at Research Data Alliance Third...Data Description Registry Interoperability WG at Research Data Alliance Third...
Data Description Registry Interoperability WG at Research Data Alliance Third...amiraryani
 
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesA Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesBesnik Fetahu
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisCrowdFlower
 
Text analysis
Text analysisText analysis
Text analysisshahidzac
 
Data analysis – using computers
Data analysis – using computersData analysis – using computers
Data analysis – using computersNoonapau
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and SharingC. Tobin Magle
 
Topic Extraction using Machine Learning
Topic Extraction using Machine LearningTopic Extraction using Machine Learning
Topic Extraction using Machine LearningSanjib Basak
 

Similar to SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ SLAM 2015 (20)

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
 
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
 
Introduction to Text Mining and Topic Modelling
Introduction to Text Mining and Topic ModellingIntroduction to Text Mining and Topic Modelling
Introduction to Text Mining and Topic Modelling
 
Introduction to NVivo
Introduction to NVivoIntroduction to NVivo
Introduction to NVivo
 
TSL3133 Topic 11 Qualitative Data Analysis
TSL3133 Topic 11 Qualitative Data AnalysisTSL3133 Topic 11 Qualitative Data Analysis
TSL3133 Topic 11 Qualitative Data Analysis
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Data Description Registry Interoperability WG at Research Data Alliance Third...
Data Description Registry Interoperability WG at Research Data Alliance Third...Data Description Registry Interoperability WG at Research Data Alliance Third...
Data Description Registry Interoperability WG at Research Data Alliance Third...
 
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesA Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 
Text analysis
Text analysisText analysis
Text analysis
 
Data analysis – using computers
Data analysis – using computersData analysis – using computers
Data analysis – using computers
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
Topic Extraction using Machine Learning
Topic Extraction using Machine LearningTopic Extraction using Machine Learning
Topic Extraction using Machine Learning
 

More from Damiano Spina

A Formal Account of Effectiveness Evaluation and Ranking Fusion
A Formal Account of Effectiveness Evaluation and Ranking FusionA Formal Account of Effectiveness Evaluation and Ranking Fusion
A Formal Account of Effectiveness Evaluation and Ranking FusionDamiano Spina
 
Learning Similarity Functions for Topic Detection in Online Reputation Monito...
Learning Similarity Functions for Topic Detection in Online Reputation Monito...Learning Similarity Functions for Topic Detection in Online Reputation Monito...
Learning Similarity Functions for Topic Detection in Online Reputation Monito...Damiano Spina
 
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in TwitterORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in TwitterDamiano Spina
 
Online Reputation Monitoring in Twitter from an Information Access Perspective
Online Reputation Monitoring in Twitter from an Information Access PerspectiveOnline Reputation Monitoring in Twitter from an Information Access Perspective
Online Reputation Monitoring in Twitter from an Information Access PerspectiveDamiano Spina
 
Towards an Active Learning System for Company Name Disambiguation in Microblo...
Towards an Active Learning System for Company Name Disambiguation in Microblo...Towards an Active Learning System for Company Name Disambiguation in Microblo...
Towards an Active Learning System for Company Name Disambiguation in Microblo...Damiano Spina
 
UNED Online Reputation Monitoring Team at RepLab 2013
UNED Online Reputation Monitoring Team at RepLab 2013UNED Online Reputation Monitoring Team at RepLab 2013
UNED Online Reputation Monitoring Team at RepLab 2013Damiano Spina
 
Identifying Entity Aspects in Microblog Posts
Identifying Entity Aspects in Microblog PostsIdentifying Entity Aspects in Microblog Posts
Identifying Entity Aspects in Microblog PostsDamiano Spina
 
Towards Real-Time Summarization of Scheduled Events from Twitter Streams
Towards Real-Time Summarization of Scheduled Events from Twitter StreamsTowards Real-Time Summarization of Scheduled Events from Twitter Streams
Towards Real-Time Summarization of Scheduled Events from Twitter StreamsDamiano Spina
 
A Corpus for Entity Profiling in Microblog Posts
A Corpus for Entity Profiling in Microblog PostsA Corpus for Entity Profiling in Microblog Posts
A Corpus for Entity Profiling in Microblog PostsDamiano Spina
 
Filter keywords and majority class strategies for company name disambiguation...
Filter keywords and majority class strategies for company name disambiguation...Filter keywords and majority class strategies for company name disambiguation...
Filter keywords and majority class strategies for company name disambiguation...Damiano Spina
 
Evaluación de sistemas de monitorización de contenidos generados por usuarios
Evaluación de sistemas de monitorización de contenidos generados por usuariosEvaluación de sistemas de monitorización de contenidos generados por usuarios
Evaluación de sistemas de monitorización de contenidos generados por usuariosDamiano Spina
 
Caracterización de una entidad basada en opiniones: un estudio de caso
Caracterización de una entidad basada en opiniones: un estudio de casoCaracterización de una entidad basada en opiniones: un estudio de caso
Caracterización de una entidad basada en opiniones: un estudio de casoDamiano Spina
 

More from Damiano Spina (12)

A Formal Account of Effectiveness Evaluation and Ranking Fusion
A Formal Account of Effectiveness Evaluation and Ranking FusionA Formal Account of Effectiveness Evaluation and Ranking Fusion
A Formal Account of Effectiveness Evaluation and Ranking Fusion
 
Learning Similarity Functions for Topic Detection in Online Reputation Monito...
Learning Similarity Functions for Topic Detection in Online Reputation Monito...Learning Similarity Functions for Topic Detection in Online Reputation Monito...
Learning Similarity Functions for Topic Detection in Online Reputation Monito...
 
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in TwitterORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
 
Online Reputation Monitoring in Twitter from an Information Access Perspective
Online Reputation Monitoring in Twitter from an Information Access PerspectiveOnline Reputation Monitoring in Twitter from an Information Access Perspective
Online Reputation Monitoring in Twitter from an Information Access Perspective
 
Towards an Active Learning System for Company Name Disambiguation in Microblo...
Towards an Active Learning System for Company Name Disambiguation in Microblo...Towards an Active Learning System for Company Name Disambiguation in Microblo...
Towards an Active Learning System for Company Name Disambiguation in Microblo...
 
UNED Online Reputation Monitoring Team at RepLab 2013
UNED Online Reputation Monitoring Team at RepLab 2013UNED Online Reputation Monitoring Team at RepLab 2013
UNED Online Reputation Monitoring Team at RepLab 2013
 
Identifying Entity Aspects in Microblog Posts
Identifying Entity Aspects in Microblog PostsIdentifying Entity Aspects in Microblog Posts
Identifying Entity Aspects in Microblog Posts
 
Towards Real-Time Summarization of Scheduled Events from Twitter Streams
Towards Real-Time Summarization of Scheduled Events from Twitter StreamsTowards Real-Time Summarization of Scheduled Events from Twitter Streams
Towards Real-Time Summarization of Scheduled Events from Twitter Streams
 
A Corpus for Entity Profiling in Microblog Posts
A Corpus for Entity Profiling in Microblog PostsA Corpus for Entity Profiling in Microblog Posts
A Corpus for Entity Profiling in Microblog Posts
 
Filter keywords and majority class strategies for company name disambiguation...
Filter keywords and majority class strategies for company name disambiguation...Filter keywords and majority class strategies for company name disambiguation...
Filter keywords and majority class strategies for company name disambiguation...
 
Evaluación de sistemas de monitorización de contenidos generados por usuarios
Evaluación de sistemas de monitorización de contenidos generados por usuariosEvaluación de sistemas de monitorización de contenidos generados por usuarios
Evaluación de sistemas de monitorización de contenidos generados por usuarios
 
Caracterización de una entidad basada en opiniones: un estudio de caso
Caracterización de una entidad basada en opiniones: un estudio de casoCaracterización de una entidad basada en opiniones: un estudio de caso
Caracterización de una entidad basada en opiniones: un estudio de caso
 

Recently uploaded

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 

Recently uploaded (20)

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 

SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ SLAM 2015

  • 1. SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents Damiano Spina, Johanne R. Trippas, Lawrence Cavedon, Mark Sanderson
  • 2. An Extreme Example: Discussing about ‘Merengue’ (Spanish) {dance, egg, whip, Terpsichore, Latin, America, white, dessert} {dance, Terpsichore, Latin, America} {dessert, whip, white, egg} What is the dialogue about? Not considering speakers Considering speakersVS.
  • 3. Hypothesis Considering information about speakers—which words/fragments correspond to each speaker—would improve topic discovery
  • 4. Example: Topic Discovery for Recommendation {dance, Terpsichore, Latin, America} {dessert, whip, white, egg} More Like This More content about dance More content about desserts
  • 5. Topic Discovery in Multi-Speaker Audio Contents: Applications • Multi-Speaker Audio Contents: • Podcasts (news, shows, interviews, etc.) • Meetings • TV programs • Applications: • Content-based Recommendation: ‘more like this’ • Clustering • Group search results according to topics • E.g., Search Result Presentation
  • 6. Research Question What is the impact in terms of effectiveness of adding speaker information to a topic model when compared to traditional approaches (i.e., LDA)?
  • 7. Topic Discovery [Image from Blei, D. Probabilistic Topic Models, Communication of the ACM, 2012] Distribution of topics over words Distribution of topics over documents
  • 8. Topic Discovery vs. Topic Segmentation Topic Discovery Topic Segmentation • Characterizes how a conversation evolves over time in terms of topics • 1 document ~ sequence of topics • Characterizes documents according to topics • 1 document ~ distribution of topics t1 t3 t2 t3 t2 t1 time t1 t2 t3
  • 9. Topic Discovery vs. Topic Segmentation Topic Discovery Topic Segmentation Not using speaker information Latent Dirichlet Allocation (LDA) [Blei et al., 2003] TextTiling [Hearst, 1997] [Purver et al. 2006] Using speaker information ? SITS [Nguyen et al., 2012]
  • 10. Topic Discovery vs. Topic Segmentation Topic Discovery Topic Segmentation Not using speaker information Latent Dirichlet Allocation (LDA) [Blei et al., 2003] TextTiling [Hearst, 1997] [Purver et al. 2006] Using speaker information SpeakerLDA SITS [Nguyen et al., 2012] RQ RQ'
  • 11. Proposed Approach: SpeakerLDA • Split documents (D) according to speakers (S) • Run LDA • Combine topic distributions obtained for each speaker’s pseudo- document ds qds
  • 13. Evaluation Framework • Topic models are typically evaluated by (i) computing intrinsic metrics (e.g., perplexity) of the the model in an unseen set of documents or (ii) being applied to external information access tasks (e.g., topic detection as a clustering task) • Needs manually annotated ground truth • One possible measure: Precision/Recall of clustering relationships
  • 14. Evaluation Framework II • Is there any test collection suitable for measuring differences between our approach and existent topic models? • Must satisfy following conditions A. Each topic is discussed in two or more documents B. Include spoken documents with two or more speakers The AMI Corpus satisfies both conditions!
  • 15. The AMI Corpus • Augmented Multi-Party Interaction (AMI) Corpus • 100 hours of recorded audio • More than 100 meetings with multiple speakers (generally 4) • Real and elicited scenario-driven meetings • Speakers play different roles: • Interface designer, project manager, industrial designer, marketing • Manual transcriptions, including speaker segmentation • Transcripts segmented according to topics and subtopics
  • 16.
  • 17. Generating a Gold Standard for Topic Discovery
  • 18. Work in Progress • Compare the effectiveness of SpeakerLDA vs. LDA (and vs. topic segmentation approaches) • Extrinsic Evaluation: compare system outputs to clustering gold standard
  • 19. 0.0 0.2 0.4 0.6 0.8 0.25 0.50 0.75 1.00 Sensitivity (BCubed Recall) Reliability(BCubedPrecision) system LDA SpeakerLDA • AMI Corpus • Topic Segmentation annotations as clustering gold standard • Varying initial number of topics • Considering the n most frequent topics in the topic-document distribution for topic assignment
  • 20. Work in Progress • Compare the effectiveness of SpeakerLDA vs. LDA (and vs. topic segmentation approaches) • Extrinsic Evaluation: compare system outputs to clustering gold standard • Challenge: How to define a valid clustering gold standard from topic segmentation annotations? • Opportunity: Compare system output to topic distribution gold standard. • Generate distributions from annotated segments
  • 21. {closing=0.09, opening=0.03, components...=0.21, discussion=0.06, industrial...=0.21, interface…=0.21, marketing...=0.20} Gold topic distribution for the meeting IS1008c:
  • 22. Conclusions • We propose SpeakerLDA, a topic model that takes into account speaker information to discover what a set of audio documents (such as podcasts) is about • It can be used for clustering search results or content-based recommendation (´more like this´) • We are currently investigating how to generate a clustering gold standard from topic segmentation annotations in the AMI Corpus • Evaluate topic models by comparing against a topic distribution gold standard?
  • 23. Thank you! - For dessert we have...'Merengue'!
  • 24. SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents Damiano Spina, Johanne R. Trippas, Lawrence Cavedon, Mark Sanderson @damiano10 damiano.spina@rmit.edu.au