Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Detecting and Describing Historical Periods in a Large Corpora

Many historic periods (or events) are remembered
by slogans, expressions or words that are strongly linked to them. Educated people are also able to determine whether a particular word or expression is related to a specific period in human history. The present paper aims to establish correlations between significant historic periods (or events) and the texts written in that period. In order to achieve this, we have developed a system that automatically links words (and topics discovered using Latent Dirichlet Allocation) to periods of time in the recent history. For this analysis to be relevant and conclusive, it must be undertaken on a representative set of texts written throughout history. To this end, instead of relying on manually selected texts, the Google Books Ngram corpus has been chosen as a basis for the analysis. Although it provides only word n-gram statistics for the texts written in a given year, the resulting time series can be used to provide insights about the most important periods and events in recent history, by automatically linking them with specific keywords or even LDA topics.

  • Be the first to comment

Detecting and Describing Historical Periods in a Large Corpora

  1. 1. Detecting and Describing Historical Periods in a Large Corpora Tiberiu Popa, Traian Rebedea, Costin Chiru University Politehnica of Bucharest Faculty of Automatic Control and Computers
  2. 2. Outline • Context • Architecture • Historical Features Detection • Topic Modeling for Historically Relevant Documents • Results • Future Work & Conclusions 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 2
  3. 3. Context • Many historic events are remembered by slogans, expressions or words that are strongly linked to them • Try to establish the correlation between significant historic events and the texts written in that period • The analysis should be based on a representative set of texts written throughout history 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 3
  4. 4. Context • The outcome should contain: – A separation of the years such that each year in a group are related to a specific event – A short description for each group of years • Examples: – 1858–1864, 1867–1868: rebel, confederate, secession, vicksburg, chattanooga – 1969–1981: pollution, nixon, slavery, blacks, urbanization 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 4
  5. 5. Google Books Ngrams • Corpus that contains statistics extracted from over 5 million books, or about 4% of all books ever published (in English) • Due to copyright restrictions, only frequency statistics are provided for each word • Frequencies ranging from unigrams to 5-grams • Books from 1500 to 2008 (nowadays) 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 5
  6. 6. Google Books Ngrams • For each word, the associated time series is denoted by 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 6
  7. 7. Related Work • Culturomics – quantitative analysis of culture – Computationally investigation of cultural trends (e.g. using Google Books, or other corpora over a large period of time) – “can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology” • Semantic evolution of words over time – Topics over time – Time influences the meaning of a word => change of topics/meanings over time • Evolution of the topics in a specific research field (e.g. computational linguistics) over time using topic models – Showed the rise of probabilistic models in NLP 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 7
  8. 8. Architecture Google Books N-grams Historical Relevant Documents Relevant Historical Topics 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 8
  9. 9. Detection of Historical Features • A special case of bursty feature detection • Detects periods of increased activity in the time series • For each n-gram, it must also assign a “bursty” weight to each year (integer between 0 - 10) 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 9
  10. 10. Double Change • Peaks usually consist of a period of abrupt increase, followed by another period of abrupt decrease • Compute the relative change from one year to another • The bursty weight rt depends on 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 10
  11. 11. Linear Model • Approximate the frequency time series by a piecewise linear function – Fit lines to the graph of the time series by considering larger and larger intervals until the error rises above a given threshold • The bursty weight rt depends on the logarithm of the slope 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 11
  12. 12. Gaussian Model • Peaks are usually bell-shaped, so try to fit a Gaussian distribution • First, normalize the time series to get a probability distribution • Then, try to approximate it with a normal distribution 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 12
  13. 13. Gaussian Model • Last, use the earth mover’s distance (EMD) to compute the similarity between and • Select non-overlapping intervals that have a EMD lower than 0.3 in a greedy fashion from left to right • The bursty weight rt depends on the change of the fitted Gaussian (max vs. min value) for each discrete interval 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 13
  14. 14. Detection of Historical Features - Comparison • Difficult to measure which of these three methods performs best at detecting and characterizing historical relevant peaks • Need a dataset created with the help of historians 26th IEEE International Conference on Tools 11 Nov 2014 14 with Artificial Intelligence, ICTAI 2014
  15. 15. Historically Relevant Documents • Each year is viewed as a document • The weight of a term in a specific year is given by rt – For all terms that have rt > 0 • Try to cluster these documents and summarize each cluster • Use LDA (Latent Dirichlet Allocation) to extract topics 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 15
  16. 16. Results • Topic modeling (e.g. LDA) allows each document to capture a mixture of topics • The analysis of the topics shows that most years have a predominant topic (over 50% in the corresponding mixture) • The table contains a post-processed version of the topics for the last century • Manually removed the noisy words that appeared in the top 10 words for each topic 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 16
  17. 17. Results – American Civil War • Topic for the American Civil War (1858-1864, 1867-1868) • Double change bursty feature detection 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 17
  18. 18. Results – WWI • Topic for the World War I and peace treaty (1916-1920) • Gaussian model bursty feature detection 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 18
  19. 19. Results – pre-WWII • Topic for the period before World War II (1932-1936) • Linear model peak detection 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 19
  20. 20. Future Work • Exploring alternatives – Computing the historical relevance of a word has a lot of potential for improvement, both in finding new definitions and in finding ways to combine the existing ones – Are topic models really the key of understanding historically relevant documents? • Improve the validation – Build a corpus, with the help of historians and linguists, that contains a set of ”historical relevant” peaks and periods 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 20
  21. 21. Conclusions • Theoretical framework for identifying historic periods and events • Linking these periods with words and LDA topics extracted from large corpora of texts • Important concept: historical relevance of a word • Several methods for computing the historical relevant features 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 21
  22. 22. Questions? Discussion _____ _____ This work has been funded by the Sectorial Operational Programme Human Resources Development 2007-2013 of the Romanian Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/132397 11 Nov 2014 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014 22

×