Many historic periods (or events) are remembered
by slogans, expressions or words that are strongly linked to them. Educated people are also able to determine whether a particular word or expression is related to a specific period in human history. The present paper aims to establish correlations between significant historic periods (or events) and the texts written in that period. In order to achieve this, we have developed a system that automatically links words (and topics discovered using Latent Dirichlet Allocation) to periods of time in the recent history. For this analysis to be relevant and conclusive, it must be undertaken on a representative set of texts written throughout history. To this end, instead of relying on manually selected texts, the Google Books Ngram corpus has been chosen as a basis for the analysis. Although it provides only word n-gram statistics for the texts written in a given year, the resulting time series can be used to provide insights about the most important periods and events in recent history, by automatically linking them with specific keywords or even LDA topics.
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Detecting and Describing Historical Periods in a Large Corpora
1. Detecting and Describing Historical
Periods in a Large Corpora
Tiberiu Popa, Traian Rebedea, Costin Chiru
University Politehnica of Bucharest
Faculty of Automatic Control and Computers
2. Outline
• Context
• Architecture
• Historical Features Detection
• Topic Modeling for Historically Relevant
Documents
• Results
• Future Work & Conclusions
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
2
3. Context
• Many historic events are remembered by slogans,
expressions or words that are strongly linked to them
• Try to establish the correlation between significant
historic events and the texts written in that period
• The analysis should be based on a representative set
of texts written throughout history
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
3
4. Context
• The outcome should contain:
– A separation of the years such that each year in a
group are related to a specific event
– A short description for each group of years
• Examples:
– 1858–1864, 1867–1868: rebel, confederate, secession,
vicksburg, chattanooga
– 1969–1981: pollution, nixon, slavery, blacks,
urbanization
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
4
5. Google Books Ngrams
• Corpus that contains statistics extracted from
over 5 million books, or about 4% of all books
ever published (in English)
• Due to copyright restrictions, only frequency
statistics are provided for each word
• Frequencies ranging from unigrams to 5-grams
• Books from 1500 to 2008 (nowadays)
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
5
6. Google Books Ngrams
• For each word, the associated time series is
denoted by
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
6
7. Related Work
• Culturomics – quantitative analysis of culture
– Computationally investigation of cultural trends (e.g. using
Google Books, or other corpora over a large period of time)
– “can provide insights about fields as diverse as lexicography, the
evolution of grammar, collective memory, the adoption of
technology, the pursuit of fame, censorship, and historical
epidemiology”
• Semantic evolution of words over time
– Topics over time
– Time influences the meaning of a word => change of
topics/meanings over time
• Evolution of the topics in a specific research field (e.g.
computational linguistics) over time using topic models
– Showed the rise of probabilistic models in NLP
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
7
8. Architecture
Google
Books
N-grams
Historical
Relevant
Documents
Relevant
Historical
Topics
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
8
9. Detection of Historical Features
• A special case of bursty feature detection
• Detects periods of increased activity in the time series
• For each n-gram, it must also assign a “bursty” weight to each
year (integer between 0 - 10)
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
9
10. Double Change
• Peaks usually consist of a period of abrupt increase, followed
by another period of abrupt decrease
• Compute the relative change from one year to another
• The bursty weight rt depends on
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
10
11. Linear Model
• Approximate the frequency time series by a piecewise linear
function
– Fit lines to the graph of the time series by considering larger and larger
intervals until the error rises above a given threshold
• The bursty weight rt depends on the logarithm of the slope
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
11
12. Gaussian Model
• Peaks are usually bell-shaped, so try to fit a Gaussian
distribution
• First, normalize the time series to get a probability
distribution
• Then, try to approximate it with a normal distribution
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
12
13. Gaussian Model
• Last, use the earth mover’s distance (EMD) to compute the
similarity between and
• Select non-overlapping intervals that have a EMD lower than
0.3 in a greedy fashion from left to right
• The bursty weight rt depends on the change of the fitted
Gaussian (max vs. min value) for each discrete interval
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
13
14. Detection of Historical Features -
Comparison
• Difficult to measure which of these three methods performs
best at detecting and characterizing historical relevant peaks
• Need a dataset created with the help of historians
26th IEEE International Conference on Tools
11 Nov 2014 14
with Artificial Intelligence, ICTAI 2014
15. Historically Relevant Documents
• Each year is viewed as a document
• The weight of a term in a specific year is given by rt
– For all terms that have rt > 0
• Try to cluster these documents and summarize each cluster
• Use LDA (Latent Dirichlet Allocation) to extract topics
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
15
16. Results
• Topic modeling (e.g. LDA)
allows each document to
capture a mixture of topics
• The analysis of the topics
shows that most years have a
predominant topic (over 50%
in the corresponding mixture)
• The table contains a
post-processed version of the
topics for the last century
• Manually removed the noisy
words that appeared in the top
10 words for each topic
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
16
17. Results – American Civil War
• Topic for the American Civil War (1858-1864,
1867-1868)
• Double change bursty feature detection
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
17
18. Results – WWI
• Topic for the World War I and peace treaty
(1916-1920)
• Gaussian model bursty feature detection
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
18
19. Results – pre-WWII
• Topic for the period before World War II
(1932-1936)
• Linear model peak detection
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
19
20. Future Work
• Exploring alternatives
– Computing the historical relevance of a word has a lot
of potential for improvement, both in finding new
definitions and in finding ways to combine the existing
ones
– Are topic models really the key of understanding
historically relevant documents?
• Improve the validation
– Build a corpus, with the help of historians and
linguists, that contains a set of ”historical relevant”
peaks and periods
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
20
21. Conclusions
• Theoretical framework for identifying historic periods
and events
• Linking these periods with words and LDA topics
extracted from large corpora of texts
• Important concept: historical relevance of a word
• Several methods for computing the historical
relevant features
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
21
22. Questions?
Discussion
_____
_____
This work has been funded by the
Sectorial Operational Programme
Human Resources Development
2007-2013 of the Romanian Ministry of
European Funds through the Financial
Agreement POSDRU/159/1.5/S/132397
11 Nov 2014
26th IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2014
22