This document discusses text analytics techniques for summarizing and analyzing unstructured text documents, with examples from analyzing documents related to tobacco control. It covers data cleaning and standardization steps like removing punctuation, stopwords, stemming, and deduplication. It also discusses frequency analysis using document-term matrices, topic modeling using LDA, and unsupervised and supervised classification techniques. The document provides examples analyzing posts from new users versus highly active users on an online forum, identifying topics and comparing topic distributions between different user groups.
7. • Translating text to consistent form
– Scrapy returns unicode strings
– Māori Maori
• SWAPSET =
[[ u"Ā", "A"], [ u"ā", "a"], [ u"ä", "a"]]
• translation_table =
dict([(ord(k), unicode(v)) for k, v in settings.SWAPSET])
• cleaned_content =
html_content.translate(translation_table)
– Or…
• test=u’Māori’ (you already have unicode)
• Unidecode(test) (returns ‘Maori’)
8. • Dealing with non-Unicode
– http://nedbatchelder.com/text/unipain.html
– Some scraped html will be in latin1 (mismatch UTF8)
– Have your datastore default to UTF-8
– Learn to love whack-a-mole
• Dealing with too many spaces:
– newstring = ' '.join(mystring.split())
– Or… use re
• Don’t forget the metadata!
– Define a common data structure early if you have
multiple sources
11. Text Standardisation
• Using dictionaries for stem completion
politi.tdm <- TermDocumentMatrix(politi.corpus)
politi.tdm = removeSparseTerms(politi.tdm, 0.99)
politi.tdm = as.matrix(politi.tdm)
# get word counts in decreasing order, put these into a plain text doc.
word_freqs = sort(rowSums(politi.tdm), decreasing=TRUE)
length(word_freqs)
smalldict = PlainTextDocument(names(word_freqs))
politi.corpus_final = tm_map(politi.corpus_stemmed,
stemCompletion, dictionary=smalldict, type="first")
17. Top 100 terms: Tariana Turia
Note: Documents from Aug 2011 – July 2012 Wordcloud package
18. Top 100 terms: Tony Ryall
Note: Documents from Aug 2011 – July 2012
19. • Exploration and feature extraction
– Metadata gathered at time of collection (eg, Scrapy)
– RODBC or MySQLdb with plain ol’ SQL
– Native or package functions for length of strings, sna, etc.
• Unsupervised
– nltk.cluster
– tm, topicmodels, as.matrix(dtm) kmeans, etc.
• Supervised
– First hurdle: Training set
– nltk.classify
– tm, e1071, others…
Classification
20. 2 posts or fewer more than 750 posts
846 1,157 23 45,499
41.0% 1.3% 1.1% 50.1%
22. • LDA (topicmodels)
– New users
– Highly active users
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
good smoke just smoke feel
day time day quit day
thank week get can dont
well patch realli one like
will start think will still
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
quit good day like feel
smoke one well day thing
can take great your just
will stay done now get
luck strong awesom get time
23. • LDA (topicmodels)
– Highly active users (HAU)
– HAU1 (F, 38, PI)
– HAU2 (F, 33, NZE)
– HAU3 (M, 48, NZE)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
quit good day like feel
smoke one well day thing
can take great your just
will stay done now get
luck strong awesom get time
18% 14% 40% 8% 20%
31% 21% 27% 6% 16%
16% 9% 21% 49% 5%
24. Recap
• Your text will probably be messy
– Python, R-based tools reduce the pain
• Simple analyses can generate useful insight
• Combine with data of other types for context
– source, quantities, dates, network position, history
• May surface useful features for classification
Slides, Code: message2ben@gmail.com
Editor's Notes
Gather stage.
Gather stage.
Clean stage
Clean stage
Clean stage
Standardise stage
Standardise stage
Standardise stage0.99 is generous. Lower would remove more terms.A term-document matrix where those terms from x are removed which have at least asparse percentage of empty (i.e., terms occurring 0 times in a document) elements. I.e., the resulting matrix contains only terms with a sparse factor of less than sparse.TermDocumentMatrix (terms along side (rows), docs along top (columns))
Dedup and select stage
Analysis stage
Analysis stage
Analysis stage
Analysis stage
Analysis stage
Analysis stage
Analysis stageDragonfly talk by Marcus Frean on LatentDirichletAllocation