Text-mining as a Research Tool in the Humanities and Social Sciences

Duke Libraries / Text > Data September 20, 2012

Text-mining
as a Research Tool
in the Humanities and Social Sciences

Ryan Shaw
ryanshaw@unc.edu
http://aesh.in/RC

@rybesh #duketext 1


@rybesh #duketext 2


Roberto Busa

@rybesh #duketext 3


Automated text analysis

@rybesh #duketext 4



Automated text analysis is a tool for discovery
and measurement in textual data of prevalent
attitudes, concepts, or events.

O'Connor, Bamman & Smith 2011
"Computational Text Analysis for Social Science"
http://goo.gl/PxruI

@rybesh #duketext 4



Automated text analysis is a tool for discovery
and measurement in textual data of patterns
of language use interpretable as
prevalent attitudes, concepts, or events.

O'Connor, Bamman & Smith 2011
"Computational Text Analysis for Social Science"
http://goo.gl/PxruI

@rybesh #duketext 5


Language modeling

Black 1962, "Models and Archetypes"
http://goo.gl/zKtrx
@rybesh #duketext 6


Language modeling

• Methods for automated text analysis are
based on mathematical models of language

http://goo.gl/zKtrx
@rybesh #duketext 6


Language modeling

• Mathematical models distinguish elements
and make explicit the relations among them

http://goo.gl/zKtrx
@rybesh #duketext 6


Language modeling

• Mathematical models distinguish elements
and make explicit the relations among them
• They do not explain, but they can be
interpreted

http://goo.gl/zKtrx
@rybesh #duketext 6


Language modeling

Grimmer & Stewart 2012, "Text as Data"
http://goo.gl/tFPFs
@rybesh #duketext 7


Language modeling

• All mathematical models of language are
necessarily wrong

http://goo.gl/tFPFs
@rybesh #duketext 7


Language modeling

necessarily wrong

• Nevertheless they may be useful

http://goo.gl/tFPFs
@rybesh #duketext 7


Language modeling

necessarily wrong

• Nevertheless they may be useful
• They must be evaluated on their ability to
help scholars make inferences, achieve
insights, and generate new interpretations

http://goo.gl/tFPFs
@rybesh #duketext 7


Plan of attack

@rybesh #duketext 8


Plan of attack

• Acquiring text

@rybesh #duketext 8


Plan of attack

• Acquiring text
• Representing text

@rybesh #duketext 8


Plan of attack

• Acquiring text
• Analyzing text

@rybesh #duketext 8


Plan of attack

• Acquiring text
• Analyzing text
• Validating results

@rybesh #duketext 8


Plan of attack

• Acquiring text
• Analyzing text
• Validating results
• Managing data
@rybesh #duketext 8


Acquiring text
Collecting your data

@rybesh #duketext 9


@rybesh #duketext 10


Sources



Sources

• Existing digital corpora



Sources

• Other digital sources (e.g. Web, twitter)



Sources

• Other digital sources (e.g. Web, twitter)
• Undigitized text



Existing digital corpora




• Ideally, texts will be available as XML




• Quality of text and metadata is high




• But collections tend to be small




• But collections tend to be small
• Licensing agreements may prohibit
text analysis



• 10.5 million total volumes

• 5.5 million book titles

• 270,000 serial titles

• 3.2 million public domain

http://www.hathitrust.org/htrc


Other digital sources



• Some kinds of texts (e.g. tweets) can be
obtained through an API



• Websites without APIs can be "scraped"



• Generally requires custom programming



• Website restrictions may limit how much
or how quickly texts can be collected



• Website restrictions may limit how much
or how quickly texts can be collected
• Metadata will be limited or absent


Undigitized text



Undigitized text

• Undigitized text must be scanned and
subjected to Optical Character Recognition



Undigitized text

• Time and labor intensive



Undigitized text

• OCR will introduce errors in your texts



Undigitized text

• OCR will introduce errors in your texts
• You need to produce your own metadata



Preparing texts



Preparing texts

• OCR errors



Preparing texts

• OCR errors
• Words broken across lines



Preparing texts

• OCR errors
• Running headers and footers



Preparing texts

• OCR errors
• Running headers and footers
• Breaking into paragraphs, sentences, etc.



Preparing texts



Preparing texts

• The bulk of your time will be spent
acquiring and preparing your texts



Preparing texts

• Worth your time to learn a scripting
language (such as Python)



Preparing texts

• Worth your time to learn a scripting
language (such as Python)
• Command-line text-processing tools
on Mac OS and Unix also very useful



Representing text
Turning words into numbers



Slowly welling from the point of her gold nib,
pale blue ink dissolved the full stop; for there
her pen stuck; her eyes ﬁxed, and tears slowly
ﬁlled them. The entire bay quivered; the
lighthouse wobbled; and she had the illusion
that the mast of Mr. Connor's little yacht was
bending like a wax candle in the sun. She
winked quickly. Accidents were awful things.
She winked again. The mast was straight; the
waves were regular; the lighthouse was upright;
but the blot had spread.



11 the 1 wax 1 quivered
3 was 1 waves 1 quickly
3 she 1 upright 1 point
3 her 1 things 1 pen
2 winked 1 there 1 pale
2 were 1 them 1 nib
2 slowly 1 that 1 mr
2 of 1 tears 1 little
2 mast 1 sun 1 like
2 lighthouse 1 stuck 1 ink
2 had 1 straight 1 in
2 and 1 stop 1 illusion
1 yacht 1 spread 1 gold
1 wobbled 1 s 1 full
1 welling 1 regular 1 from


11 the 1 wax 1 quiver
3 wa 1 wave 1 quickli
3 her 1 thing 1 pen
2 wink 1 there 1 pale
2 were 1 them 1 nib
2 slowli 1 that 1 mr
2 of 1 tear 1 littl
2 mast 1 sun 1 like
2 lighthous 1 stuck 1 ink
2 and 1 stop 1 illus
1 wobbl 1 s 1 full
1 well 1 regular 1 from


doc 1 doc 2 doc 3 doc 4 doc 5 doc 6
accid 1
actual 1
again 1 1
alreadi 1
antenna 1
archer 1
avoid 2 1
awai 1
aw 1
bag 1
bandanna 1
barfoot 2


Document similarity
2

again
1

1 2

@rybesh #duketext
avoid 24


Document similarity
2

again doc 1
1

1 2

@rybesh #duketext
avoid 24


Document similarity
2

again doc 6 doc 1
1

1 2

@rybesh #duketext
avoid 24


Document similarity
2

again doc 6 doc 1
1

ar ity
m il
si

1 2

@rybesh #duketext
avoid 24


Analyzing text
Counting, comparing, categorizing and pattern-ﬁnding



Six methods of text analysis

Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x


• Reading

Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x


• Reading
• Counting words

Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x


• Reading
• Counting words
• Human coding (manual content analysis)

Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x


• Reading
• Counting words
• Dictionary methods

Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x


• Reading
• Counting words
• Supervised machine learning
Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x


• Reading
• Counting words
• Unsupervised machine learning
Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x


Counting words

http://www.nytimes.com/ref/washington/20070123_STATEOFUNION.html


Counting words



Michel et al. 2010
@rybesh #duketext http://dx.doi.org/10.1126/science.1199644 30


Counting words



Counting words
• Easily computed



Counting words
• Easily computed
• Results are replicable



Counting words
• Easily computed
• Comparisons require metadata
e.g. publication year, language,
subject category, location



Counting words
• Easily computed
• Word use is ambiguous



Counting words
• Easily computed
• Word use is ambiguous
• Spelling may vary


Concordance tools



Dictionary methods



Dictionary methods

• A dictionary is simply a list of words



Dictionary methods

• Lists are compiled for speciﬁc categories of
interest: negative words, law-related words,
names of places, names of chemicals, etc.



Dictionary methods

• Lists are compiled for speciﬁc categories of
interest: negative words, law-related words,
names of places, names of chemicals, etc.
• May be custom-built or reused



Lexicoder Sentiment Dictionary
A LIE 0 WOUNDED 0 ABILITY* 1 WOOS 1
ABANDON* 0 WOUNDS 0 ABOUND* 1 WORKABLE* 1
ABAS* 0 WRATH* 0 ABSOLV* 1 WORKMANSHIP* 1
ABATTOIR* 0 WRECK* 0 ABSORBENT* 1 WORSHIP* 1
ABDICAT* 0 WRESTL* 0 ABSORPTION* 1 WORTH 1
ABERRA* 0 WRETCH* 0 ABUNDANC* 1 WORTH WHILE* 1
ABHOR* 0 WRITHE* 0 ABUNDANT* 1 WORTHI* 1
ABJECT* 0 WRONG* 0 ACCED* 1 WORTHWHILE* 1
ABNORMAL* 0 XENOPHOB* 0 ACCENTUAT* 1 WORTHY* 1
ABOLISH* 0 YAWN* 0 ACCEPT* 1 YOUNG AT HEART 1
ABOMINAB* 0 YEARN* 0 ACCESSIB* 1 ZEAL 1
ABOMINAT* 0 YUCK* 0 ACCLAIM* 1 ZEALOUS* 1
ABRASIV* 0 ZEALOT* 0 ACCLAMATION* 1 ZEST* 1



ACQUITTANCE DOCKET LEGALIZATIONS QUITCLAIM
ADJOURNING ESCHEATED LEGALLY REBUTS
APPELLANTS EXCEEDENCES LITIGATORS REQUESTER
APPOINTOR EXCULPATED MISTRIALS RESCINDS
ARBITRATE FOREBEAR NOTARIZE STATUTE
ASSERTABLE INASMUCH NOTARIZED SUBPARAGRAPHS
CHATTEL INDEMNITY OBLIGOR SUBPOENAS
CODIFICATIONS INJUNCTION PERSONAM SUBTRUSTS
CONVICTED INTERLOCUTORY PLEADS TENANTABILITY
COUNTERSUIT INTERPLEADER POSTJUDGMENT TESTAMENTARY
DEFEASANCE INTERROGATE PRETRIAL UNENCUMBERED
DELEGATEE IRREVOCABLY PRIMA UNREMEDIATED
DEPOSED LEGALIZATION PROSECUTIONS WHEREOF



Litigious Words
ACQUITTANCE DOCKET LEGALIZATIONS QUITCLAIM
ADJOURNING ESCHEATED LEGALLY REBUTS
APPELLANTS EXCEEDENCES LITIGATORS REQUESTER
APPOINTOR EXCULPATED MISTRIALS RESCINDS
ARBITRATE FOREBEAR NOTARIZE STATUTE
ASSERTABLE INASMUCH NOTARIZED SUBPARAGRAPHS
CHATTEL INDEMNITY OBLIGOR SUBPOENAS
CODIFICATIONS INJUNCTION PERSONAM SUBTRUSTS
CONVICTED INTERLOCUTORY PLEADS TENANTABILITY
COUNTERSUIT INTERPLEADER POSTJUDGMENT TESTAMENTARY
DEFEASANCE INTERROGATE PRETRIAL UNENCUMBERED
DELEGATEE IRREVOCABLY PRIMA UNREMEDIATED
DEPOSED LEGALIZATION PROSECUTIONS WHEREOF



Simple dictionary algorithm




• For each word in document:




• +1 if the word is in the positive list




• –1 if the word is in the negative list




• –1 if the word is in the negative list
• Divide the total by the number of words



26 uses of positive words



–
51 uses of negative words



–
=
–25



–

–25 / 779 total words



–

–25 / 779 total words
=
–0.032



AGAINST LIMITED
AGGRESSIVENESS LIMITING
ATTACK NEGATE
ATTACKING OFFENSE
CHALLENGE OFFENSIVE ADEQUATELY IMPROVEMENT
CONTRAST OFFENSIVELY ADVANTAGE KEEPING
DEFENSIVE OPPOSING ASSISTS LIKE
DEFICIENCIES PLAGUED EFFICIENT PATRIOT
DEVIL POOR EFFICIENTLY PERFECT
DEVILS PROBLEM EFFORT RESPONSIBLE
DISMAL SHORTCOMINGS FREE SIGNIFICANT
EXPLOIT SLUGGISH FRESHMAN STRONGER
FAILED THORNTON GOOD SUCCESS
FOUL THREATS GREAT WELL
FOULING TOO
FOULS TROUBLE
FUTILITY TROUBLES
INABILITY UNABLE



Supervised machine learning




• The situation:
you know the categories of interest




• The situation:

• The problem:
human coding of documents doesn't scale




• The situation:

• The problem:
human coding of documents doesn't scale
• The solution:
teach a robot to do it



Welcome your
robot overlords



Augmenting human capacity



1. Create a training set.



2. Use the training set to "teach" a supervised
learning algorithm how to map document
features (e.g. words) to categories.



3. Test your classifying machine to see if it
learned correctly.



3. Test your classifying machine to see if it
learned correctly.
4. Use it to classify the rest of your documents.



Creating a training set




• Create a coding scheme that humans can
use reliably and without ambiguity.




• Select (ideally randomly) a subset of your
documents, and code them by hand.




• Select (ideally randomly) a subset of your
documents, and code them by hand.
• You need "enough" documents:
more categories, more documents.



Supervised learning algorithms



• Many kinds:
Naïve Bayes, decision trees / random
forests, support vector machines, neural
networks, etc.



• Many kinds:
networks, etc.
• No "best" one: performance is domain- and
dataset-speciﬁc



• Many kinds:
networks, etc.
• No "best" one: performance is domain- and
dataset-speciﬁc
• "Ensembles" of different algorithms can
often outperform single algorithms



Unsupervised machine learning



• The situation:
you don't know the categories of
interest, or want to discover new ones



• The situation:
• The solution:
have a robot explore and ﬁnd possible
categorizations for you, and use them to
categorize documents



• The situation:
• The solution:
have a robot explore and ﬁnd possible
categorizations for you, and use them to
categorize documents
• Also known as "clustering"


No free lunch

http://goo.gl/tFPFs


No free lunch

• No need for manual coding beforehand

http://goo.gl/tFPFs


No free lunch

• But as much or more manual labor
is needed to evaluate suggested
categorizations afterwards

http://goo.gl/tFPFs


No free lunch

• But as much or more manual labor
is needed to evaluate suggested
categorizations afterwards
• The value is a novel categorization,
not time or labor saved

http://goo.gl/tFPFs


Two kinds of
unsupervised learning



Two kinds of

• Single membership clustering:
each document is assigned to one category



Two kinds of

• Single membership clustering:
each document is assigned to one category
• Mixed membership clustering:
a document may be assigned to multiple
categories, each with a different proportion



Single membership clustering




1. Deﬁne a quantitative measure of similarity
between documents.




between documents.
2. Deﬁne a quantitative measure of how
"good" a cluster is.




between documents.
2. Deﬁne a quantitative measure of how
"good" a cluster is.
3. Deﬁne a process for optimizing the overall
goodness of the clusters.



http://shabal.in/visuals.html


Mixed membership clustering




• Topic modeling is a popular example




• Each document is modeled as a mixture of
categories or topics




• A document is a probability distribution
over topics




• A document is a probability distribution
over topics
• A topic is a probability distribution
over words



Probability distribution



"Generating" text



"Generating" text

1. Roll our "topic dice" to choose a topic.



"Generating" text

2. Get the "word dice" corresponding to the
the chosen topic.



"Generating" text

the chosen topic.
3. Roll the "word dice" to choose a word.



"Generating" text

the chosen topic.
3. Roll the "word dice" to choose a word.
4. Repeat until we've chosen all the words for
our text.



Topic modeling demo



http://dsl.richmond.edu/dispatch/



Complex statistics / computation

Topic models

Weaker Stronger
domain Supervised methods domain
assumptions assumptions

Word counting Dictionary
methods

Simple statistics / computation
@rybesh #duketext O'Connor, Bamman & Smith 2011 http://goo.gl/PxruI 63


Validating results
Keeping the machines from leading you astray



Validating word counts




• Text data may have errors (e.g. from OCR)




• Metadata may have errors




• Texts may appear multiple times




• Texts may appear multiple times
• Collections are biased samples



http://languagelog.ldc.upenn.edu/nll/?p=1701



Validating dictionary methods




• Must verify that dictionary categorizations
match human judgments




• But humans can't reliably "score"
documents on "positivity" or "litigiousness"




• But humans can't reliably "score"
documents on "positivity" or "litigiousness"
• Better to convert scores to simple binaries



Validating supervised methods




• Ideally: take two random non-overlapping
samples and manually code them.




• Use the ﬁrst sample to train your
supervised learning algorithm.




• Use the ﬁrst sample to train your
supervised learning algorithm.
• Use the second sample to evaluate its
performance.



ﬁgurative mixed literal

ﬁgurative 57 32 2

mixed 21 30 6

literal 0 4 110

@rybesh #duketext
262 documents 69


Accuracy: 197 / 262 = 75%


ﬁgurative 57 32 2

mixed 21 30 6

literal 0 4 110

@rybesh #duketext
262 documents 69


Precision: 57 / 78 = 73%
ﬁgurative category


ﬁgurative 57 32 2

mixed 21 30 6

literal 0 4 110

@rybesh #duketext
262 documents 70


Recall: 57 / 91 = 63%
ﬁgurative category


ﬁgurative 57 32 2

mixed 21 30 6

literal 0 4 110

@rybesh #duketext
262 documents 71


Validating unsupervised methods




• There are statistical measures of how well
a particular clustering "ﬁts" the data




• These are not appropriate for
evaluating unsupervised clustering of texts




• These are not appropriate for
evaluating unsupervised clustering of texts
• The "data" is butchered text, we don't
want to ﬁt it well




• Does the categorization make sense?




• Are the categories distinct?




• Are they internally consistent?




• Are they internally consistent?
• Do they provide insight?



Validating topic coherence

{ dog, cat, horse, apple, pig, cow }

Chang et al. 2009
http://goo.gl/FCizP





{ car, teacher, platypus, agile, blue, Zaire }
Chang et al. 2009
http://goo.gl/FCizP





{ car, teacher, platypus, agile, blue, Zaire }
? Chang et al. 2009
http://goo.gl/FCizP



Validating topic assignment




• Compared to other (manual) categorizations,
how well does this one approximate judgments
of document relatedness?




• Do the categories correlate with external facts?




• Do the categories correlate with external facts?
• Turn the categories into a coding scheme and
apply supervised methods



Managing data
Helping others stand on your shoulders



Three kinds of data



Three kinds of data

1. The texts you're analyzing and derivations
thereof



Three kinds of data

thereof
2. The software code you're using to process
and analyze your texts



Three kinds of data

thereof
2. The software code you're using to process
and analyze your texts
3. Documentation of your process



Textual data



Textual data

• You want to keep all intermediate versions
of the texts you're processing



Textual data

• A version control system is ideal for this



Textual data

• A version control system is ideal for this
• Version control hosting platforms such as
GitHub are ideal for sharing your data too



Software data



Software data

• Ideally, use open-source software



Software data

• Keep past versions of whatever software
you use



Software data

• Keep past versions of whatever software
you use
• Use version control for your own scripts
and software



Documentary data



Documentary data

• This is the hardest data to manage



Documentary data

• Consider keeping a (public or private)
"lab notebook" blog



Documentary data

• Consider keeping a (public or private)
"lab notebook" blog
• Anything else you write related to the
project, formal or informal



Long-term preservation




• Data under version control can be exported,
including all versions




• Create static snapshots of websites, blogs, etc.




• Create static snapshots of websites, blogs, etc.
• Place everything in a long-term digital
repository such as DukeSpace



Take-aways



Take-aways

• Text analysis can be a powerful tool.



Take-aways

• It's a systematic method of transforming
texts to produce new texts for interpretation.



Take-aways

• It only augments human judgment and
interpretation; it can't replace them.



Take-aways

• It only augments human judgment and
interpretation; it can't replace them.
• Be excited by the possibilities
but skeptical of the hype.



Thanks!



Thanks!
http://aesh.in/RC



Thanks!
http://aesh.in/RC
ryanshaw@unc.edu


Text-mining as a Research Tool in the Humanities and Social Sciences

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Text-mining as a Research Tool in the Humanities and Social Sciences

Editor's Notes