SlideShare a Scribd company logo
1 of 22
Detecting and Describing Historical 
Periods in a Large Corpora 
Tiberiu Popa, Traian Rebedea, Costin Chiru 
University Politehnica of Bucharest 
Faculty of Automatic Control and Computers
Outline 
• Context 
• Architecture 
• Historical Features Detection 
• Topic Modeling for Historically Relevant 
Documents 
• Results 
• Future Work & Conclusions 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
2
Context 
• Many historic events are remembered by slogans, 
expressions or words that are strongly linked to them 
• Try to establish the correlation between significant 
historic events and the texts written in that period 
• The analysis should be based on a representative set 
of texts written throughout history 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
3
Context 
• The outcome should contain: 
– A separation of the years such that each year in a 
group are related to a specific event 
– A short description for each group of years 
• Examples: 
– 1858–1864, 1867–1868: rebel, confederate, secession, 
vicksburg, chattanooga 
– 1969–1981: pollution, nixon, slavery, blacks, 
urbanization 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
4
Google Books Ngrams 
• Corpus that contains statistics extracted from 
over 5 million books, or about 4% of all books 
ever published (in English) 
• Due to copyright restrictions, only frequency 
statistics are provided for each word 
• Frequencies ranging from unigrams to 5-grams 
• Books from 1500 to 2008 (nowadays) 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
5
Google Books Ngrams 
• For each word, the associated time series is 
denoted by 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
6
Related Work 
• Culturomics – quantitative analysis of culture 
– Computationally investigation of cultural trends (e.g. using 
Google Books, or other corpora over a large period of time) 
– “can provide insights about fields as diverse as lexicography, the 
evolution of grammar, collective memory, the adoption of 
technology, the pursuit of fame, censorship, and historical 
epidemiology” 
• Semantic evolution of words over time 
– Topics over time 
– Time influences the meaning of a word => change of 
topics/meanings over time 
• Evolution of the topics in a specific research field (e.g. 
computational linguistics) over time using topic models 
– Showed the rise of probabilistic models in NLP 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
7
Architecture 
Google 
Books 
N-grams 
Historical 
Relevant 
Documents 
Relevant 
Historical 
Topics 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
8
Detection of Historical Features 
• A special case of bursty feature detection 
• Detects periods of increased activity in the time series 
• For each n-gram, it must also assign a “bursty” weight to each 
year (integer between 0 - 10) 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
9
Double Change 
• Peaks usually consist of a period of abrupt increase, followed 
by another period of abrupt decrease 
• Compute the relative change from one year to another 
• The bursty weight rt depends on 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
10
Linear Model 
• Approximate the frequency time series by a piecewise linear 
function 
– Fit lines to the graph of the time series by considering larger and larger 
intervals until the error rises above a given threshold 
• The bursty weight rt depends on the logarithm of the slope 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
11
Gaussian Model 
• Peaks are usually bell-shaped, so try to fit a Gaussian 
distribution 
• First, normalize the time series to get a probability 
distribution 
• Then, try to approximate it with a normal distribution 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
12
Gaussian Model 
• Last, use the earth mover’s distance (EMD) to compute the 
similarity between and 
• Select non-overlapping intervals that have a EMD lower than 
0.3 in a greedy fashion from left to right 
• The bursty weight rt depends on the change of the fitted 
Gaussian (max vs. min value) for each discrete interval 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
13
Detection of Historical Features - 
Comparison 
• Difficult to measure which of these three methods performs 
best at detecting and characterizing historical relevant peaks 
• Need a dataset created with the help of historians 
26th IEEE International Conference on Tools 
11 Nov 2014 14 
with Artificial Intelligence, ICTAI 2014
Historically Relevant Documents 
• Each year is viewed as a document 
• The weight of a term in a specific year is given by rt 
– For all terms that have rt > 0 
• Try to cluster these documents and summarize each cluster 
• Use LDA (Latent Dirichlet Allocation) to extract topics 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
15
Results 
• Topic modeling (e.g. LDA) 
allows each document to 
capture a mixture of topics 
• The analysis of the topics 
shows that most years have a 
predominant topic (over 50% 
in the corresponding mixture) 
• The table contains a 
post-processed version of the 
topics for the last century 
• Manually removed the noisy 
words that appeared in the top 
10 words for each topic 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
16
Results – American Civil War 
• Topic for the American Civil War (1858-1864, 
1867-1868) 
• Double change bursty feature detection 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
17
Results – WWI 
• Topic for the World War I and peace treaty 
(1916-1920) 
• Gaussian model bursty feature detection 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
18
Results – pre-WWII 
• Topic for the period before World War II 
(1932-1936) 
• Linear model peak detection 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
19
Future Work 
• Exploring alternatives 
– Computing the historical relevance of a word has a lot 
of potential for improvement, both in finding new 
definitions and in finding ways to combine the existing 
ones 
– Are topic models really the key of understanding 
historically relevant documents? 
• Improve the validation 
– Build a corpus, with the help of historians and 
linguists, that contains a set of ”historical relevant” 
peaks and periods 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
20
Conclusions 
• Theoretical framework for identifying historic periods 
and events 
• Linking these periods with words and LDA topics 
extracted from large corpora of texts 
• Important concept: historical relevance of a word 
• Several methods for computing the historical 
relevant features 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
21
Questions? 
Discussion 
_____ 
_____ 
This work has been funded by the 
Sectorial Operational Programme 
Human Resources Development 
2007-2013 of the Romanian Ministry of 
European Funds through the Financial 
Agreement POSDRU/159/1.5/S/132397 
11 Nov 2014 
26th IEEE International Conference on Tools 
with Artificial Intelligence, ICTAI 2014 
22

More Related Content

What's hot

Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and ChallengesJens Lehmann
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
 
SelQA: A New Benchmark for Selection-based Question Answering
SelQA: A New Benchmark for Selection-based Question AnsweringSelQA: A New Benchmark for Selection-based Question Answering
SelQA: A New Benchmark for Selection-based Question AnsweringJinho Choi
 
Big Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingBig Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingNa'im Tyson
 
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1Traian Rebedea
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Lviv Data Science Summer School
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational SemanticsMarina Santini
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结君 廖
 

What's hot (20)

Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
Question answering
Question answeringQuestion answering
Question answering
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
SelQA: A New Benchmark for Selection-based Question Answering
SelQA: A New Benchmark for Selection-based Question AnsweringSelQA: A New Benchmark for Selection-based Question Answering
SelQA: A New Benchmark for Selection-based Question Answering
 
Big Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingBig Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic Processing
 
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Language models
Language modelsLanguage models
Language models
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational Semantics
 
AINL 2016: Yagunova
AINL 2016: YagunovaAINL 2016: Yagunova
AINL 2016: Yagunova
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 

Viewers also liked

Importanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriImportanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriTraian Rebedea
 
Algorithm Design and Complexity - Course 9
Algorithm Design and Complexity - Course 9Algorithm Design and Complexity - Course 9
Algorithm Design and Complexity - Course 9Traian Rebedea
 
Algorithm Design and Complexity - Course 8
Algorithm Design and Complexity - Course 8Algorithm Design and Complexity - Course 8
Algorithm Design and Complexity - Course 8Traian Rebedea
 
Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Traian Rebedea
 
Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3Traian Rebedea
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringTraian Rebedea
 
Software Services in Romania – Academia and Industry
Software Services in Romania – Academia and IndustrySoftware Services in Romania – Academia and Industry
Software Services in Romania – Academia and IndustryTraian Rebedea
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Traian Rebedea
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Traian Rebedea
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Traian Rebedea
 
Algorithm Design and Complexity - Course 6
Algorithm Design and Complexity - Course 6Algorithm Design and Complexity - Course 6
Algorithm Design and Complexity - Course 6Traian Rebedea
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Traian Rebedea
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaTraian Rebedea
 
Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10Traian Rebedea
 
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic Progamming
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic ProgammingAlgorithm Design and Complexity - Course 4 - Heaps and Dynamic Progamming
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic ProgammingTraian Rebedea
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyTraian Rebedea
 
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitareTraian Rebedea
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Traian Rebedea
 
Algorithm Design and Complexity - Course 7
Algorithm Design and Complexity - Course 7Algorithm Design and Complexity - Course 7
Algorithm Design and Complexity - Course 7Traian Rebedea
 

Viewers also liked (20)

Importanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriImportanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuri
 
Algorithm Design and Complexity - Course 9
Algorithm Design and Complexity - Course 9Algorithm Design and Complexity - Course 9
Algorithm Design and Complexity - Course 9
 
Algorithm Design and Complexity - Course 8
Algorithm Design and Complexity - Course 8Algorithm Design and Complexity - Course 8
Algorithm Design and Complexity - Course 8
 
Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2
 
Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Software Services in Romania – Academia and Industry
Software Services in Romania – Academia and IndustrySoftware Services in Romania – Academia and Industry
Software Services in Romania – Academia and Industry
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009
 
Algorithm Design and Complexity - Course 6
Algorithm Design and Complexity - Course 6Algorithm Design and Complexity - Course 6
Algorithm Design and Complexity - Course 6
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora
 
Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10
 
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic Progamming
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic ProgammingAlgorithm Design and Complexity - Course 4 - Heaps and Dynamic Progamming
Algorithm Design and Complexity - Course 4 - Heaps and Dynamic Progamming
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD Survey
 
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitare
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11
 
Algorithm Design and Complexity - Course 7
Algorithm Design and Complexity - Course 7Algorithm Design and Complexity - Course 7
Algorithm Design and Complexity - Course 7
 

Similar to Detecting and Describing Historical Periods in a Large Corpora

EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube
 
Wikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic TrackingWikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic TrackingSeokhwan Kim
 
Capturing the Behaviors of the Elusive User: Strategies for Library Ethnography
Capturing the Behaviors of the Elusive User: Strategies for Library EthnographyCapturing the Behaviors of the Elusive User: Strategies for Library Ethnography
Capturing the Behaviors of the Elusive User: Strategies for Library EthnographyLynn Connaway
 
Capturing the Behaviors of the Elusive User: Strategies for Library Ethnography
Capturing the Behaviors of the Elusive User: Strategies for Library EthnographyCapturing the Behaviors of the Elusive User: Strategies for Library Ethnography
Capturing the Behaviors of the Elusive User: Strategies for Library EthnographyOCLC
 
Metric Fields in Information Science
Metric Fields in Information ScienceMetric Fields in Information Science
Metric Fields in Information ScienceGladys Wakat
 
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with BibliometricsBibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with BibliometricsGESIS
 
New directions in scholarly publishing: journal articles beyond the present
New directions in scholarly publishing: journal articles beyond the presentNew directions in scholarly publishing: journal articles beyond the present
New directions in scholarly publishing: journal articles beyond the presentRudjer Boskovic Institute
 
Lorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DHLorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DHlorna_hughes
 
Visualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham CorpusVisualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham CorpusUCLDH
 
Social Media Crawling & Mining Seminar
Social Media Crawling & Mining Seminar Social Media Crawling & Mining Seminar
Social Media Crawling & Mining Seminar Symeon Papadopoulos
 
Open Science, Open Data: towards a new transparent and reproducible ecosystem
Open Science, Open Data:   towards a new transparent and reproducible ecosystemOpen Science, Open Data:   towards a new transparent and reproducible ecosystem
Open Science, Open Data: towards a new transparent and reproducible ecosystemLIBER Europe
 
How Libraries Use Publisher Metadata Redux (Steven Shadle)
How Libraries Use Publisher Metadata Redux (Steven Shadle)How Libraries Use Publisher Metadata Redux (Steven Shadle)
How Libraries Use Publisher Metadata Redux (Steven Shadle)Charleston Conference
 
Sapere project-introduction-dec-2010
Sapere project-introduction-dec-2010Sapere project-introduction-dec-2010
Sapere project-introduction-dec-2010awarenessproject
 
Collaborative development of born-digital archives to facilitate discovery | ...
Collaborative development of born-digital archives to facilitate discovery | ...Collaborative development of born-digital archives to facilitate discovery | ...
Collaborative development of born-digital archives to facilitate discovery | ...ResearchLibrariesUK
 
Does DH Scholarship Take Place in the Lab?
Does DH Scholarship Take Place in the Lab?Does DH Scholarship Take Place in the Lab?
Does DH Scholarship Take Place in the Lab?Shawn Day
 
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...ALISS
 

Similar to Detecting and Describing Historical Periods in a Large Corpora (20)

EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
 
The Virtual Research Environment and Libraries
The Virtual Research Environment and LibrariesThe Virtual Research Environment and Libraries
The Virtual Research Environment and Libraries
 
Wikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic TrackingWikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic Tracking
 
Capturing the Behaviors of the Elusive User: Strategies for Library Ethnography
Capturing the Behaviors of the Elusive User: Strategies for Library EthnographyCapturing the Behaviors of the Elusive User: Strategies for Library Ethnography
Capturing the Behaviors of the Elusive User: Strategies for Library Ethnography
 
Capturing the Behaviors of the Elusive User: Strategies for Library Ethnography
Capturing the Behaviors of the Elusive User: Strategies for Library EthnographyCapturing the Behaviors of the Elusive User: Strategies for Library Ethnography
Capturing the Behaviors of the Elusive User: Strategies for Library Ethnography
 
Metric Fields in Information Science
Metric Fields in Information ScienceMetric Fields in Information Science
Metric Fields in Information Science
 
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with BibliometricsBibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
 
New directions in scholarly publishing: journal articles beyond the present
New directions in scholarly publishing: journal articles beyond the presentNew directions in scholarly publishing: journal articles beyond the present
New directions in scholarly publishing: journal articles beyond the present
 
Lorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DHLorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DH
 
Visualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham CorpusVisualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham Corpus
 
Social Media Crawling & Mining Seminar
Social Media Crawling & Mining Seminar Social Media Crawling & Mining Seminar
Social Media Crawling & Mining Seminar
 
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
 
Open Science, Open Data: towards a new transparent and reproducible ecosystem
Open Science, Open Data:   towards a new transparent and reproducible ecosystemOpen Science, Open Data:   towards a new transparent and reproducible ecosystem
Open Science, Open Data: towards a new transparent and reproducible ecosystem
 
08. EDT 513 2023 Week 8.pptx
08. EDT 513 2023 Week 8.pptx08. EDT 513 2023 Week 8.pptx
08. EDT 513 2023 Week 8.pptx
 
NECTAR_VRE1
NECTAR_VRE1NECTAR_VRE1
NECTAR_VRE1
 
How Libraries Use Publisher Metadata Redux (Steven Shadle)
How Libraries Use Publisher Metadata Redux (Steven Shadle)How Libraries Use Publisher Metadata Redux (Steven Shadle)
How Libraries Use Publisher Metadata Redux (Steven Shadle)
 
Sapere project-introduction-dec-2010
Sapere project-introduction-dec-2010Sapere project-introduction-dec-2010
Sapere project-introduction-dec-2010
 
Collaborative development of born-digital archives to facilitate discovery | ...
Collaborative development of born-digital archives to facilitate discovery | ...Collaborative development of born-digital archives to facilitate discovery | ...
Collaborative development of born-digital archives to facilitate discovery | ...
 
Does DH Scholarship Take Place in the Lab?
Does DH Scholarship Take Place in the Lab?Does DH Scholarship Take Place in the Lab?
Does DH Scholarship Take Place in the Lab?
 
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...
 

Recently uploaded

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 

Recently uploaded (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 

Detecting and Describing Historical Periods in a Large Corpora

  • 1. Detecting and Describing Historical Periods in a Large Corpora Tiberiu Popa, Traian Rebedea, Costin Chiru University Politehnica of Bucharest Faculty of Automatic Control and Computers
  • 2. Outline • Context • Architecture • Historical Features Detection • Topic Modeling for Historically Relevant Documents • Results • Future Work & Conclusions 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 2
  • 3. Context • Many historic events are remembered by slogans, expressions or words that are strongly linked to them • Try to establish the correlation between significant historic events and the texts written in that period • The analysis should be based on a representative set of texts written throughout history 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 3
  • 4. Context • The outcome should contain: – A separation of the years such that each year in a group are related to a specific event – A short description for each group of years • Examples: – 1858–1864, 1867–1868: rebel, confederate, secession, vicksburg, chattanooga – 1969–1981: pollution, nixon, slavery, blacks, urbanization 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 4
  • 5. Google Books Ngrams • Corpus that contains statistics extracted from over 5 million books, or about 4% of all books ever published (in English) • Due to copyright restrictions, only frequency statistics are provided for each word • Frequencies ranging from unigrams to 5-grams • Books from 1500 to 2008 (nowadays) 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 5
  • 6. Google Books Ngrams • For each word, the associated time series is denoted by 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 6
  • 7. Related Work • Culturomics – quantitative analysis of culture – Computationally investigation of cultural trends (e.g. using Google Books, or other corpora over a large period of time) – “can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology” • Semantic evolution of words over time – Topics over time – Time influences the meaning of a word => change of topics/meanings over time • Evolution of the topics in a specific research field (e.g. computational linguistics) over time using topic models – Showed the rise of probabilistic models in NLP 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 7
  • 8. Architecture Google Books N-grams Historical Relevant Documents Relevant Historical Topics 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 8
  • 9. Detection of Historical Features • A special case of bursty feature detection • Detects periods of increased activity in the time series • For each n-gram, it must also assign a “bursty” weight to each year (integer between 0 - 10) 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 9
  • 10. Double Change • Peaks usually consist of a period of abrupt increase, followed by another period of abrupt decrease • Compute the relative change from one year to another • The bursty weight rt depends on 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 10
  • 11. Linear Model • Approximate the frequency time series by a piecewise linear function – Fit lines to the graph of the time series by considering larger and larger intervals until the error rises above a given threshold • The bursty weight rt depends on the logarithm of the slope 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 11
  • 12. Gaussian Model • Peaks are usually bell-shaped, so try to fit a Gaussian distribution • First, normalize the time series to get a probability distribution • Then, try to approximate it with a normal distribution 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 12
  • 13. Gaussian Model • Last, use the earth mover’s distance (EMD) to compute the similarity between and • Select non-overlapping intervals that have a EMD lower than 0.3 in a greedy fashion from left to right • The bursty weight rt depends on the change of the fitted Gaussian (max vs. min value) for each discrete interval 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 13
  • 14. Detection of Historical Features - Comparison • Difficult to measure which of these three methods performs best at detecting and characterizing historical relevant peaks • Need a dataset created with the help of historians 26th IEEE International Conference on Tools 11 Nov 2014 14 with Artificial Intelligence, ICTAI 2014
  • 15. Historically Relevant Documents • Each year is viewed as a document • The weight of a term in a specific year is given by rt – For all terms that have rt > 0 • Try to cluster these documents and summarize each cluster • Use LDA (Latent Dirichlet Allocation) to extract topics 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 15
  • 16. Results • Topic modeling (e.g. LDA) allows each document to capture a mixture of topics • The analysis of the topics shows that most years have a predominant topic (over 50% in the corresponding mixture) • The table contains a post-processed version of the topics for the last century • Manually removed the noisy words that appeared in the top 10 words for each topic 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 16
  • 17. Results – American Civil War • Topic for the American Civil War (1858-1864, 1867-1868) • Double change bursty feature detection 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 17
  • 18. Results – WWI • Topic for the World War I and peace treaty (1916-1920) • Gaussian model bursty feature detection 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 18
  • 19. Results – pre-WWII • Topic for the period before World War II (1932-1936) • Linear model peak detection 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 19
  • 20. Future Work • Exploring alternatives – Computing the historical relevance of a word has a lot of potential for improvement, both in finding new definitions and in finding ways to combine the existing ones – Are topic models really the key of understanding historically relevant documents? • Improve the validation – Build a corpus, with the help of historians and linguists, that contains a set of ”historical relevant” peaks and periods 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 20
  • 21. Conclusions • Theoretical framework for identifying historic periods and events • Linking these periods with words and LDA topics extracted from large corpora of texts • Important concept: historical relevance of a word • Several methods for computing the historical relevant features 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 21
  • 22. Questions? Discussion _____ _____ This work has been funded by the Sectorial Operational Programme Human Resources Development 2007-2013 of the Romanian Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/132397 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 22