Ryan Shaw (School of Information & Library Science, UNC Chapel Hill) provides an overview and a critique of text-mining projects, and discusses project design, methodology, scope, integrity of data and analysis as well as preservation. This presentation will help scholars understand the research potential of text mining, and offer a summary of issues and concerns about technology and methods.
See also:
http://aesh.in/RC
http://sfy.co/e8ys
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Text-mining as a Research Tool in the Humanities and Social Sciences
1. Duke Libraries / Text > Data September 20, 2012
Text-mining
as a Research Tool
in the Humanities and Social Sciences
Ryan Shaw
ryanshaw@unc.edu
http://aesh.in/RC
@rybesh #duketext 1
2. Duke Libraries / Text > Data September 20, 2012
Text-mining
as a Research Tool
in the Humanities and Social Sciences
Ryan Shaw
ryanshaw@unc.edu
http://aesh.in/RC
@rybesh #duketext 1
3. Duke Libraries / Text > Data September 20, 2012
Text-mining
as a Research Tool
in the Humanities and Social Sciences
Ryan Shaw
ryanshaw@unc.edu
http://aesh.in/RC
@rybesh #duketext 1
4. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 2
5. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 2
6. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 2
7. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 2
8. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 2
9. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 2
10. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 2
11. Duke Libraries / Text > Data September 20, 2012
Roberto Busa
@rybesh #duketext 3
12. Duke Libraries / Text > Data September 20, 2012
Automated text analysis
@rybesh #duketext 4
13. Duke Libraries / Text > Data September 20, 2012
Automated text analysis
Automated text analysis is a tool for discovery
and measurement in textual data of prevalent
attitudes, concepts, or events.
O'Connor, Bamman & Smith 2011
"Computational Text Analysis for Social Science"
http://goo.gl/PxruI
@rybesh #duketext 4
14. Duke Libraries / Text > Data September 20, 2012
Automated text analysis
Automated text analysis is a tool for discovery
and measurement in textual data of patterns
of language use interpretable as
prevalent attitudes, concepts, or events.
O'Connor, Bamman & Smith 2011
"Computational Text Analysis for Social Science"
http://goo.gl/PxruI
@rybesh #duketext 5
15. Duke Libraries / Text > Data September 20, 2012
Language modeling
Black 1962, "Models and Archetypes"
http://goo.gl/zKtrx
@rybesh #duketext 6
16. Duke Libraries / Text > Data September 20, 2012
Language modeling
• Methods for automated text analysis are
based on mathematical models of language
Black 1962, "Models and Archetypes"
http://goo.gl/zKtrx
@rybesh #duketext 6
17. Duke Libraries / Text > Data September 20, 2012
Language modeling
• Methods for automated text analysis are
based on mathematical models of language
• Mathematical models distinguish elements
and make explicit the relations among them
Black 1962, "Models and Archetypes"
http://goo.gl/zKtrx
@rybesh #duketext 6
18. Duke Libraries / Text > Data September 20, 2012
Language modeling
• Methods for automated text analysis are
based on mathematical models of language
• Mathematical models distinguish elements
and make explicit the relations among them
• They do not explain, but they can be
interpreted
Black 1962, "Models and Archetypes"
http://goo.gl/zKtrx
@rybesh #duketext 6
19. Duke Libraries / Text > Data September 20, 2012
Language modeling
Grimmer & Stewart 2012, "Text as Data"
http://goo.gl/tFPFs
@rybesh #duketext 7
20. Duke Libraries / Text > Data September 20, 2012
Language modeling
• All mathematical models of language are
necessarily wrong
Grimmer & Stewart 2012, "Text as Data"
http://goo.gl/tFPFs
@rybesh #duketext 7
21. Duke Libraries / Text > Data September 20, 2012
Language modeling
• All mathematical models of language are
necessarily wrong
• Nevertheless they may be useful
Grimmer & Stewart 2012, "Text as Data"
http://goo.gl/tFPFs
@rybesh #duketext 7
22. Duke Libraries / Text > Data September 20, 2012
Language modeling
• All mathematical models of language are
necessarily wrong
• Nevertheless they may be useful
• They must be evaluated on their ability to
help scholars make inferences, achieve
insights, and generate new interpretations
Grimmer & Stewart 2012, "Text as Data"
http://goo.gl/tFPFs
@rybesh #duketext 7
23. Duke Libraries / Text > Data September 20, 2012
Plan of attack
@rybesh #duketext 8
24. Duke Libraries / Text > Data September 20, 2012
Plan of attack
• Acquiring text
@rybesh #duketext 8
25. Duke Libraries / Text > Data September 20, 2012
Plan of attack
• Acquiring text
• Representing text
@rybesh #duketext 8
26. Duke Libraries / Text > Data September 20, 2012
Plan of attack
• Acquiring text
• Representing text
• Analyzing text
@rybesh #duketext 8
27. Duke Libraries / Text > Data September 20, 2012
Plan of attack
• Acquiring text
• Representing text
• Analyzing text
• Validating results
@rybesh #duketext 8
28. Duke Libraries / Text > Data September 20, 2012
Plan of attack
• Acquiring text
• Representing text
• Analyzing text
• Validating results
• Managing data
@rybesh #duketext 8
29. Duke Libraries / Text > Data September 20, 2012
Acquiring text
Collecting your data
@rybesh #duketext 9
30. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 10
31. Duke Libraries / Text > Data September 20, 2012
Sources
@rybesh #duketext 11
32. Duke Libraries / Text > Data September 20, 2012
Sources
• Existing digital corpora
@rybesh #duketext 11
33. Duke Libraries / Text > Data September 20, 2012
Sources
• Existing digital corpora
• Other digital sources (e.g. Web, twitter)
@rybesh #duketext 11
34. Duke Libraries / Text > Data September 20, 2012
Sources
• Existing digital corpora
• Other digital sources (e.g. Web, twitter)
• Undigitized text
@rybesh #duketext 11
35. Duke Libraries / Text > Data September 20, 2012
Existing digital corpora
@rybesh #duketext 12
36. Duke Libraries / Text > Data September 20, 2012
Existing digital corpora
• Ideally, texts will be available as XML
@rybesh #duketext 12
37. Duke Libraries / Text > Data September 20, 2012
Existing digital corpora
• Ideally, texts will be available as XML
• Quality of text and metadata is high
@rybesh #duketext 12
38. Duke Libraries / Text > Data September 20, 2012
Existing digital corpora
• Ideally, texts will be available as XML
• Quality of text and metadata is high
• But collections tend to be small
@rybesh #duketext 12
39. Duke Libraries / Text > Data September 20, 2012
Existing digital corpora
• Ideally, texts will be available as XML
• Quality of text and metadata is high
• But collections tend to be small
• Licensing agreements may prohibit
text analysis
@rybesh #duketext 12
40. Duke Libraries / Text > Data September 20, 2012
• 10.5 million total volumes
• 5.5 million book titles
• 270,000 serial titles
• 3.2 million public domain
http://www.hathitrust.org/htrc
@rybesh #duketext 13
41. Duke Libraries / Text > Data September 20, 2012
Other digital sources
@rybesh #duketext 14
42. Duke Libraries / Text > Data September 20, 2012
Other digital sources
• Some kinds of texts (e.g. tweets) can be
obtained through an API
@rybesh #duketext 14
43. Duke Libraries / Text > Data September 20, 2012
Other digital sources
• Some kinds of texts (e.g. tweets) can be
obtained through an API
• Websites without APIs can be "scraped"
@rybesh #duketext 14
44. Duke Libraries / Text > Data September 20, 2012
Other digital sources
• Some kinds of texts (e.g. tweets) can be
obtained through an API
• Websites without APIs can be "scraped"
• Generally requires custom programming
@rybesh #duketext 14
45. Duke Libraries / Text > Data September 20, 2012
Other digital sources
• Some kinds of texts (e.g. tweets) can be
obtained through an API
• Websites without APIs can be "scraped"
• Generally requires custom programming
• Website restrictions may limit how much
or how quickly texts can be collected
@rybesh #duketext 14
46. Duke Libraries / Text > Data September 20, 2012
Other digital sources
• Some kinds of texts (e.g. tweets) can be
obtained through an API
• Websites without APIs can be "scraped"
• Generally requires custom programming
• Website restrictions may limit how much
or how quickly texts can be collected
• Metadata will be limited or absent
@rybesh #duketext 14
47. Duke Libraries / Text > Data September 20, 2012
Undigitized text
@rybesh #duketext 15
48. Duke Libraries / Text > Data September 20, 2012
Undigitized text
• Undigitized text must be scanned and
subjected to Optical Character Recognition
@rybesh #duketext 15
49. Duke Libraries / Text > Data September 20, 2012
Undigitized text
• Undigitized text must be scanned and
subjected to Optical Character Recognition
• Time and labor intensive
@rybesh #duketext 15
50. Duke Libraries / Text > Data September 20, 2012
Undigitized text
• Undigitized text must be scanned and
subjected to Optical Character Recognition
• Time and labor intensive
• OCR will introduce errors in your texts
@rybesh #duketext 15
51. Duke Libraries / Text > Data September 20, 2012
Undigitized text
• Undigitized text must be scanned and
subjected to Optical Character Recognition
• Time and labor intensive
• OCR will introduce errors in your texts
• You need to produce your own metadata
@rybesh #duketext 15
52. Duke Libraries / Text > Data September 20, 2012
Preparing texts
@rybesh #duketext 16
53. Duke Libraries / Text > Data September 20, 2012
Preparing texts
• OCR errors
@rybesh #duketext 16
54. Duke Libraries / Text > Data September 20, 2012
Preparing texts
• OCR errors
• Words broken across lines
@rybesh #duketext 16
55. Duke Libraries / Text > Data September 20, 2012
Preparing texts
• OCR errors
• Words broken across lines
• Running headers and footers
@rybesh #duketext 16
56. Duke Libraries / Text > Data September 20, 2012
Preparing texts
• OCR errors
• Words broken across lines
• Running headers and footers
• Breaking into paragraphs, sentences, etc.
@rybesh #duketext 16
57. Duke Libraries / Text > Data September 20, 2012
Preparing texts
@rybesh #duketext 17
58. Duke Libraries / Text > Data September 20, 2012
Preparing texts
• The bulk of your time will be spent
acquiring and preparing your texts
@rybesh #duketext 17
59. Duke Libraries / Text > Data September 20, 2012
Preparing texts
• The bulk of your time will be spent
acquiring and preparing your texts
• Worth your time to learn a scripting
language (such as Python)
@rybesh #duketext 17
60. Duke Libraries / Text > Data September 20, 2012
Preparing texts
• The bulk of your time will be spent
acquiring and preparing your texts
• Worth your time to learn a scripting
language (such as Python)
• Command-line text-processing tools
on Mac OS and Unix also very useful
@rybesh #duketext 17
61. Duke Libraries / Text > Data September 20, 2012
Representing text
Turning words into numbers
@rybesh #duketext 18
62. Duke Libraries / Text > Data September 20, 2012
Slowly welling from the point of her gold nib,
pale blue ink dissolved the full stop; for there
her pen stuck; her eyes fixed, and tears slowly
filled them. The entire bay quivered; the
lighthouse wobbled; and she had the illusion
that the mast of Mr. Connor's little yacht was
bending like a wax candle in the sun. She
winked quickly. Accidents were awful things.
She winked again. The mast was straight; the
waves were regular; the lighthouse was upright;
but the blot had spread.
@rybesh #duketext 19
63. Duke Libraries / Text > Data September 20, 2012
11 the 1 wax 1 quivered
3 was 1 waves 1 quickly
3 she 1 upright 1 point
3 her 1 things 1 pen
2 winked 1 there 1 pale
2 were 1 them 1 nib
2 slowly 1 that 1 mr
2 of 1 tears 1 little
2 mast 1 sun 1 like
2 lighthouse 1 stuck 1 ink
2 had 1 straight 1 in
2 and 1 stop 1 illusion
1 yacht 1 spread 1 gold
1 wobbled 1 s 1 full
1 welling 1 regular 1 from
@rybesh #duketext 20
64. Duke Libraries / Text > Data September 20, 2012
11 the 1 wax 1 quiver
3 wa 1 wave 1 quickli
3 she 1 upright 1 point
3 her 1 thing 1 pen
2 wink 1 there 1 pale
2 were 1 them 1 nib
2 slowli 1 that 1 mr
2 of 1 tear 1 littl
2 mast 1 sun 1 like
2 lighthous 1 stuck 1 ink
2 had 1 straight 1 in
2 and 1 stop 1 illus
1 yacht 1 spread 1 gold
1 wobbl 1 s 1 full
1 well 1 regular 1 from
@rybesh #duketext 21
65. Duke Libraries / Text > Data September 20, 2012
11 the 1 wax 1 quiver
3 wa 1 wave 1 quickli
3 she 1 upright 1 point
3 her 1 thing 1 pen
2 wink 1 there 1 pale
2 were 1 them 1 nib
2 slowli 1 that 1 mr
2 of 1 tear 1 littl
2 mast 1 sun 1 like
2 lighthous 1 stuck 1 ink
2 had 1 straight 1 in
2 and 1 stop 1 illus
1 yacht 1 spread 1 gold
1 wobbl 1 s 1 full
1 well 1 regular 1 from
@rybesh #duketext 22
66. Duke Libraries / Text > Data September 20, 2012
doc 1 doc 2 doc 3 doc 4 doc 5 doc 6
accid 1
actual 1
again 1 1
alreadi 1
antenna 1
archer 1
avoid 2 1
awai 1
aw 1
bag 1
bandanna 1
barfoot 2
@rybesh #duketext 23
67. Duke Libraries / Text > Data September 20, 2012
Document similarity
2
again
1
1 2
@rybesh #duketext
avoid 24
68. Duke Libraries / Text > Data September 20, 2012
Document similarity
2
again doc 1
1
1 2
@rybesh #duketext
avoid 24
69. Duke Libraries / Text > Data September 20, 2012
Document similarity
2
again doc 6 doc 1
1
1 2
@rybesh #duketext
avoid 24
70. Duke Libraries / Text > Data September 20, 2012
Document similarity
2
again doc 6 doc 1
1
ar ity
m il
si
1 2
@rybesh #duketext
avoid 24
71. Duke Libraries / Text > Data September 20, 2012
Analyzing text
Counting, comparing, categorizing and pattern-finding
@rybesh #duketext 25
72. Duke Libraries / Text > Data September 20, 2012
Six methods of text analysis
Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext 26
73. Duke Libraries / Text > Data September 20, 2012
Six methods of text analysis
• Reading
Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext 26
74. Duke Libraries / Text > Data September 20, 2012
Six methods of text analysis
• Reading
• Counting words
Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext 26
75. Duke Libraries / Text > Data September 20, 2012
Six methods of text analysis
• Reading
• Counting words
• Human coding (manual content analysis)
Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext 26
76. Duke Libraries / Text > Data September 20, 2012
Six methods of text analysis
• Reading
• Counting words
• Human coding (manual content analysis)
• Dictionary methods
Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext 26
77. Duke Libraries / Text > Data September 20, 2012
Six methods of text analysis
• Reading
• Counting words
• Human coding (manual content analysis)
• Dictionary methods
• Supervised machine learning
Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext 26
78. Duke Libraries / Text > Data September 20, 2012
Six methods of text analysis
• Reading
• Counting words
• Human coding (manual content analysis)
• Dictionary methods
• Supervised machine learning
• Unsupervised machine learning
Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext 26
79. Duke Libraries / Text > Data September 20, 2012
Six methods of text analysis
• Reading
• Counting words
• Human coding (manual content analysis)
• Dictionary methods
• Supervised machine learning
• Unsupervised machine learning
Quinn et al. 2010
http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext 27
80. Duke Libraries / Text > Data September 20, 2012
Counting words
http://www.nytimes.com/ref/washington/20070123_STATEOFUNION.html
@rybesh #duketext 28
81. Duke Libraries / Text > Data September 20, 2012
Counting words
@rybesh #duketext 29
82. Duke Libraries / Text > Data September 20, 2012
Michel et al. 2010
@rybesh #duketext http://dx.doi.org/10.1126/science.1199644 30
83. Duke Libraries / Text > Data September 20, 2012
Counting words
@rybesh #duketext 31
84. Duke Libraries / Text > Data September 20, 2012
Counting words
• Easily computed
@rybesh #duketext 31
85. Duke Libraries / Text > Data September 20, 2012
Counting words
• Easily computed
• Results are replicable
@rybesh #duketext 31
86. Duke Libraries / Text > Data September 20, 2012
Counting words
• Easily computed
• Results are replicable
• Comparisons require metadata
e.g. publication year, language,
subject category, location
@rybesh #duketext 31
87. Duke Libraries / Text > Data September 20, 2012
Counting words
• Easily computed
• Results are replicable
• Comparisons require metadata
e.g. publication year, language,
subject category, location
• Word use is ambiguous
@rybesh #duketext 31
88. Duke Libraries / Text > Data September 20, 2012
Counting words
• Easily computed
• Results are replicable
• Comparisons require metadata
e.g. publication year, language,
subject category, location
• Word use is ambiguous
• Spelling may vary
@rybesh #duketext 31
89. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 32
90. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 33
91. Duke Libraries / Text > Data September 20, 2012
Concordance tools
@rybesh #duketext 34
92. Duke Libraries / Text > Data September 20, 2012
Dictionary methods
@rybesh #duketext 35
93. Duke Libraries / Text > Data September 20, 2012
Dictionary methods
• A dictionary is simply a list of words
@rybesh #duketext 35
94. Duke Libraries / Text > Data September 20, 2012
Dictionary methods
• A dictionary is simply a list of words
• Lists are compiled for specific categories of
interest: negative words, law-related words,
names of places, names of chemicals, etc.
@rybesh #duketext 35
95. Duke Libraries / Text > Data September 20, 2012
Dictionary methods
• A dictionary is simply a list of words
• Lists are compiled for specific categories of
interest: negative words, law-related words,
names of places, names of chemicals, etc.
• May be custom-built or reused
@rybesh #duketext 35
99. Duke Libraries / Text > Data September 20, 2012
Simple dictionary algorithm
@rybesh #duketext 38
100. Duke Libraries / Text > Data September 20, 2012
Simple dictionary algorithm
• For each word in document:
@rybesh #duketext 38
101. Duke Libraries / Text > Data September 20, 2012
Simple dictionary algorithm
• For each word in document:
• +1 if the word is in the positive list
@rybesh #duketext 38
102. Duke Libraries / Text > Data September 20, 2012
Simple dictionary algorithm
• For each word in document:
• +1 if the word is in the positive list
• –1 if the word is in the negative list
@rybesh #duketext 38
103. Duke Libraries / Text > Data September 20, 2012
Simple dictionary algorithm
• For each word in document:
• +1 if the word is in the positive list
• –1 if the word is in the negative list
• Divide the total by the number of words
@rybesh #duketext 38
104. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 39
105. Duke Libraries / Text > Data September 20, 2012
26 uses of positive words
@rybesh #duketext 40
106. Duke Libraries / Text > Data September 20, 2012
26 uses of positive words
–
51 uses of negative words
@rybesh #duketext 40
107. Duke Libraries / Text > Data September 20, 2012
26 uses of positive words
–
51 uses of negative words
=
–25
@rybesh #duketext 40
108. Duke Libraries / Text > Data September 20, 2012
26 uses of positive words
–
51 uses of negative words
–25 / 779 total words
@rybesh #duketext 40
109. Duke Libraries / Text > Data September 20, 2012
26 uses of positive words
–
51 uses of negative words
–25 / 779 total words
=
–0.032
@rybesh #duketext 40
110. Duke Libraries / Text > Data September 20, 2012
AGAINST LIMITED
AGGRESSIVENESS LIMITING
ATTACK NEGATE
ATTACKING OFFENSE
CHALLENGE OFFENSIVE ADEQUATELY IMPROVEMENT
CONTRAST OFFENSIVELY ADVANTAGE KEEPING
DEFENSIVE OPPOSING ASSISTS LIKE
DEFICIENCIES PLAGUED EFFICIENT PATRIOT
DEVIL POOR EFFICIENTLY PERFECT
DEVILS PROBLEM EFFORT RESPONSIBLE
DISMAL SHORTCOMINGS FREE SIGNIFICANT
EXPLOIT SLUGGISH FRESHMAN STRONGER
FAILED THORNTON GOOD SUCCESS
FOUL THREATS GREAT WELL
FOULING TOO
FOULS TROUBLE
FUTILITY TROUBLES
INABILITY UNABLE
@rybesh #duketext 41
111. Duke Libraries / Text > Data September 20, 2012
AGAINST LIMITED
AGGRESSIVENESS LIMITING
ATTACK NEGATE
ATTACKING OFFENSE
CHALLENGE OFFENSIVE ADEQUATELY IMPROVEMENT
CONTRAST OFFENSIVELY ADVANTAGE KEEPING
DEFENSIVE OPPOSING ASSISTS LIKE
DEFICIENCIES PLAGUED EFFICIENT PATRIOT
DEVIL POOR EFFICIENTLY PERFECT
DEVILS PROBLEM EFFORT RESPONSIBLE
DISMAL SHORTCOMINGS FREE SIGNIFICANT
EXPLOIT SLUGGISH FRESHMAN STRONGER
FAILED THORNTON GOOD SUCCESS
FOUL THREATS GREAT WELL
FOULING TOO
FOULS TROUBLE
FUTILITY TROUBLES
INABILITY UNABLE
@rybesh #duketext 42
112. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 43
113. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 43
114. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 43
115. Duke Libraries / Text > Data September 20, 2012
Supervised machine learning
@rybesh #duketext 44
116. Duke Libraries / Text > Data September 20, 2012
Supervised machine learning
• The situation:
you know the categories of interest
@rybesh #duketext 44
117. Duke Libraries / Text > Data September 20, 2012
Supervised machine learning
• The situation:
you know the categories of interest
• The problem:
human coding of documents doesn't scale
@rybesh #duketext 44
118. Duke Libraries / Text > Data September 20, 2012
Supervised machine learning
• The situation:
you know the categories of interest
• The problem:
human coding of documents doesn't scale
• The solution:
teach a robot to do it
@rybesh #duketext 44
119. Duke Libraries / Text > Data September 20, 2012
Welcome your
robot overlords
@rybesh #duketext 45
120. Duke Libraries / Text > Data September 20, 2012
Welcome your
robot overlords
@rybesh #duketext 45
121. Duke Libraries / Text > Data September 20, 2012
Augmenting human capacity
@rybesh #duketext 46
122. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 47
123. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 47
124. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 47
125. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 47
126. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 47
127. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 47
128. Duke Libraries / Text > Data September 20, 2012
Supervised machine learning
@rybesh #duketext 48
129. Duke Libraries / Text > Data September 20, 2012
Supervised machine learning
1. Create a training set.
@rybesh #duketext 48
130. Duke Libraries / Text > Data September 20, 2012
Supervised machine learning
1. Create a training set.
2. Use the training set to "teach" a supervised
learning algorithm how to map document
features (e.g. words) to categories.
@rybesh #duketext 48
131. Duke Libraries / Text > Data September 20, 2012
Supervised machine learning
1. Create a training set.
2. Use the training set to "teach" a supervised
learning algorithm how to map document
features (e.g. words) to categories.
3. Test your classifying machine to see if it
learned correctly.
@rybesh #duketext 48
132. Duke Libraries / Text > Data September 20, 2012
Supervised machine learning
1. Create a training set.
2. Use the training set to "teach" a supervised
learning algorithm how to map document
features (e.g. words) to categories.
3. Test your classifying machine to see if it
learned correctly.
4. Use it to classify the rest of your documents.
@rybesh #duketext 48
133. Duke Libraries / Text > Data September 20, 2012
Creating a training set
@rybesh #duketext 49
134. Duke Libraries / Text > Data September 20, 2012
Creating a training set
• Create a coding scheme that humans can
use reliably and without ambiguity.
@rybesh #duketext 49
135. Duke Libraries / Text > Data September 20, 2012
Creating a training set
• Create a coding scheme that humans can
use reliably and without ambiguity.
• Select (ideally randomly) a subset of your
documents, and code them by hand.
@rybesh #duketext 49
136. Duke Libraries / Text > Data September 20, 2012
Creating a training set
• Create a coding scheme that humans can
use reliably and without ambiguity.
• Select (ideally randomly) a subset of your
documents, and code them by hand.
• You need "enough" documents:
more categories, more documents.
@rybesh #duketext 49
137. Duke Libraries / Text > Data September 20, 2012
Supervised learning algorithms
@rybesh #duketext 50
138. Duke Libraries / Text > Data September 20, 2012
Supervised learning algorithms
• Many kinds:
Naïve Bayes, decision trees / random
forests, support vector machines, neural
networks, etc.
@rybesh #duketext 50
139. Duke Libraries / Text > Data September 20, 2012
Supervised learning algorithms
• Many kinds:
Naïve Bayes, decision trees / random
forests, support vector machines, neural
networks, etc.
• No "best" one: performance is domain- and
dataset-specific
@rybesh #duketext 50
140. Duke Libraries / Text > Data September 20, 2012
Supervised learning algorithms
• Many kinds:
Naïve Bayes, decision trees / random
forests, support vector machines, neural
networks, etc.
• No "best" one: performance is domain- and
dataset-specific
• "Ensembles" of different algorithms can
often outperform single algorithms
@rybesh #duketext 50
141. Duke Libraries / Text > Data September 20, 2012
Unsupervised machine learning
@rybesh #duketext 51
142. Duke Libraries / Text > Data September 20, 2012
Unsupervised machine learning
@rybesh #duketext 52
143. Duke Libraries / Text > Data September 20, 2012
Unsupervised machine learning
• The situation:
you don't know the categories of
interest, or want to discover new ones
@rybesh #duketext 52
144. Duke Libraries / Text > Data September 20, 2012
Unsupervised machine learning
• The situation:
you don't know the categories of
interest, or want to discover new ones
• The solution:
have a robot explore and find possible
categorizations for you, and use them to
categorize documents
@rybesh #duketext 52
145. Duke Libraries / Text > Data September 20, 2012
Unsupervised machine learning
• The situation:
you don't know the categories of
interest, or want to discover new ones
• The solution:
have a robot explore and find possible
categorizations for you, and use them to
categorize documents
• Also known as "clustering"
@rybesh #duketext 52
146. Duke Libraries / Text > Data September 20, 2012
No free lunch
Grimmer & Stewart 2012, "Text as Data"
http://goo.gl/tFPFs
@rybesh #duketext 53
147. Duke Libraries / Text > Data September 20, 2012
No free lunch
• No need for manual coding beforehand
Grimmer & Stewart 2012, "Text as Data"
http://goo.gl/tFPFs
@rybesh #duketext 53
148. Duke Libraries / Text > Data September 20, 2012
No free lunch
• No need for manual coding beforehand
• But as much or more manual labor
is needed to evaluate suggested
categorizations afterwards
Grimmer & Stewart 2012, "Text as Data"
http://goo.gl/tFPFs
@rybesh #duketext 53
149. Duke Libraries / Text > Data September 20, 2012
No free lunch
• No need for manual coding beforehand
• But as much or more manual labor
is needed to evaluate suggested
categorizations afterwards
• The value is a novel categorization,
not time or labor saved
Grimmer & Stewart 2012, "Text as Data"
http://goo.gl/tFPFs
@rybesh #duketext 53
150. Duke Libraries / Text > Data September 20, 2012
Two kinds of
unsupervised learning
@rybesh #duketext 54
151. Duke Libraries / Text > Data September 20, 2012
Two kinds of
unsupervised learning
• Single membership clustering:
each document is assigned to one category
@rybesh #duketext 54
152. Duke Libraries / Text > Data September 20, 2012
Two kinds of
unsupervised learning
• Single membership clustering:
each document is assigned to one category
• Mixed membership clustering:
a document may be assigned to multiple
categories, each with a different proportion
@rybesh #duketext 54
153. Duke Libraries / Text > Data September 20, 2012
Single membership clustering
@rybesh #duketext 55
154. Duke Libraries / Text > Data September 20, 2012
Single membership clustering
1. Define a quantitative measure of similarity
between documents.
@rybesh #duketext 55
155. Duke Libraries / Text > Data September 20, 2012
Single membership clustering
1. Define a quantitative measure of similarity
between documents.
2. Define a quantitative measure of how
"good" a cluster is.
@rybesh #duketext 55
156. Duke Libraries / Text > Data September 20, 2012
Single membership clustering
1. Define a quantitative measure of similarity
between documents.
2. Define a quantitative measure of how
"good" a cluster is.
3. Define a process for optimizing the overall
goodness of the clusters.
@rybesh #duketext 55
157. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 56
158. Duke Libraries / Text > Data September 20, 2012
@rybesh #duketext 56
159. Duke Libraries / Text > Data September 20, 2012
http://shabal.in/visuals.html
@rybesh #duketext 57
160. Duke Libraries / Text > Data September 20, 2012
http://shabal.in/visuals.html
@rybesh #duketext 57
161. Duke Libraries / Text > Data September 20, 2012
Mixed membership clustering
@rybesh #duketext 58
162. Duke Libraries / Text > Data September 20, 2012
Mixed membership clustering
• Topic modeling is a popular example
@rybesh #duketext 58
163. Duke Libraries / Text > Data September 20, 2012
Mixed membership clustering
• Topic modeling is a popular example
• Each document is modeled as a mixture of
categories or topics
@rybesh #duketext 58
164. Duke Libraries / Text > Data September 20, 2012
Mixed membership clustering
• Topic modeling is a popular example
• Each document is modeled as a mixture of
categories or topics
• A document is a probability distribution
over topics
@rybesh #duketext 58
165. Duke Libraries / Text > Data September 20, 2012
Mixed membership clustering
• Topic modeling is a popular example
• Each document is modeled as a mixture of
categories or topics
• A document is a probability distribution
over topics
• A topic is a probability distribution
over words
@rybesh #duketext 58
166. Duke Libraries / Text > Data September 20, 2012
Probability distribution
@rybesh #duketext 59
167. Duke Libraries / Text > Data September 20, 2012
"Generating" text
@rybesh #duketext 60
168. Duke Libraries / Text > Data September 20, 2012
"Generating" text
1. Roll our "topic dice" to choose a topic.
@rybesh #duketext 60
169. Duke Libraries / Text > Data September 20, 2012
"Generating" text
1. Roll our "topic dice" to choose a topic.
2. Get the "word dice" corresponding to the
the chosen topic.
@rybesh #duketext 60
170. Duke Libraries / Text > Data September 20, 2012
"Generating" text
1. Roll our "topic dice" to choose a topic.
2. Get the "word dice" corresponding to the
the chosen topic.
3. Roll the "word dice" to choose a word.
@rybesh #duketext 60
171. Duke Libraries / Text > Data September 20, 2012
"Generating" text
1. Roll our "topic dice" to choose a topic.
2. Get the "word dice" corresponding to the
the chosen topic.
3. Roll the "word dice" to choose a word.
4. Repeat until we've chosen all the words for
our text.
@rybesh #duketext 60
172. Duke Libraries / Text > Data September 20, 2012
Topic modeling demo
@rybesh #duketext 61
173. Duke Libraries / Text > Data September 20, 2012
http://dsl.richmond.edu/dispatch/
@rybesh #duketext 62
174. Duke Libraries / Text > Data September 20, 2012
Complex statistics / computation
Topic models
Weaker Stronger
domain Supervised methods domain
assumptions assumptions
Word counting Dictionary
methods
Simple statistics / computation
@rybesh #duketext O'Connor, Bamman & Smith 2011 http://goo.gl/PxruI 63
175. Duke Libraries / Text > Data September 20, 2012
Validating results
Keeping the machines from leading you astray
@rybesh #duketext 64
176. Duke Libraries / Text > Data September 20, 2012
Validating word counts
@rybesh #duketext 65
177. Duke Libraries / Text > Data September 20, 2012
Validating word counts
• Text data may have errors (e.g. from OCR)
@rybesh #duketext 65
178. Duke Libraries / Text > Data September 20, 2012
Validating word counts
• Text data may have errors (e.g. from OCR)
• Metadata may have errors
@rybesh #duketext 65
179. Duke Libraries / Text > Data September 20, 2012
Validating word counts
• Text data may have errors (e.g. from OCR)
• Metadata may have errors
• Texts may appear multiple times
@rybesh #duketext 65
180. Duke Libraries / Text > Data September 20, 2012
Validating word counts
• Text data may have errors (e.g. from OCR)
• Metadata may have errors
• Texts may appear multiple times
• Collections are biased samples
@rybesh #duketext 65
181. Duke Libraries / Text > Data September 20, 2012
http://languagelog.ldc.upenn.edu/nll/?p=1701
@rybesh #duketext 66
182. Duke Libraries / Text > Data September 20, 2012
http://languagelog.ldc.upenn.edu/nll/?p=1701
@rybesh #duketext 66
183. Duke Libraries / Text > Data September 20, 2012
http://languagelog.ldc.upenn.edu/nll/?p=1701
@rybesh #duketext 66
184. Duke Libraries / Text > Data September 20, 2012
http://languagelog.ldc.upenn.edu/nll/?p=1701
@rybesh #duketext 66
185. Duke Libraries / Text > Data September 20, 2012
Validating dictionary methods
@rybesh #duketext 67
186. Duke Libraries / Text > Data September 20, 2012
Validating dictionary methods
• Must verify that dictionary categorizations
match human judgments
@rybesh #duketext 67
187. Duke Libraries / Text > Data September 20, 2012
Validating dictionary methods
• Must verify that dictionary categorizations
match human judgments
• But humans can't reliably "score"
documents on "positivity" or "litigiousness"
@rybesh #duketext 67
188. Duke Libraries / Text > Data September 20, 2012
Validating dictionary methods
• Must verify that dictionary categorizations
match human judgments
• But humans can't reliably "score"
documents on "positivity" or "litigiousness"
• Better to convert scores to simple binaries
@rybesh #duketext 67
189. Duke Libraries / Text > Data September 20, 2012
Validating supervised methods
@rybesh #duketext 68
190. Duke Libraries / Text > Data September 20, 2012
Validating supervised methods
• Ideally: take two random non-overlapping
samples and manually code them.
@rybesh #duketext 68
191. Duke Libraries / Text > Data September 20, 2012
Validating supervised methods
• Ideally: take two random non-overlapping
samples and manually code them.
• Use the first sample to train your
supervised learning algorithm.
@rybesh #duketext 68
192. Duke Libraries / Text > Data September 20, 2012
Validating supervised methods
• Ideally: take two random non-overlapping
samples and manually code them.
• Use the first sample to train your
supervised learning algorithm.
• Use the second sample to evaluate its
performance.
@rybesh #duketext 68
193. Duke Libraries / Text > Data September 20, 2012
figurative mixed literal
figurative 57 32 2
mixed 21 30 6
literal 0 4 110
@rybesh #duketext
262 documents 69
194. Duke Libraries / Text > Data September 20, 2012
figurative mixed literal
figurative 57 32 2
mixed 21 30 6
literal 0 4 110
@rybesh #duketext
262 documents 69
195. Duke Libraries / Text > Data September 20, 2012
Accuracy: 197 / 262 = 75%
figurative mixed literal
figurative 57 32 2
mixed 21 30 6
literal 0 4 110
@rybesh #duketext
262 documents 69
196. Duke Libraries / Text > Data September 20, 2012
Precision: 57 / 78 = 73%
figurative category
figurative mixed literal
figurative 57 32 2
mixed 21 30 6
literal 0 4 110
@rybesh #duketext
262 documents 70
197. Duke Libraries / Text > Data September 20, 2012
Recall: 57 / 91 = 63%
figurative category
figurative mixed literal
figurative 57 32 2
mixed 21 30 6
literal 0 4 110
@rybesh #duketext
262 documents 71
198. Duke Libraries / Text > Data September 20, 2012
Validating unsupervised methods
@rybesh #duketext 72
199. Duke Libraries / Text > Data September 20, 2012
Validating unsupervised methods
• There are statistical measures of how well
a particular clustering "fits" the data
@rybesh #duketext 72
200. Duke Libraries / Text > Data September 20, 2012
Validating unsupervised methods
• There are statistical measures of how well
a particular clustering "fits" the data
• These are not appropriate for
evaluating unsupervised clustering of texts
@rybesh #duketext 72
201. Duke Libraries / Text > Data September 20, 2012
Validating unsupervised methods
• There are statistical measures of how well
a particular clustering "fits" the data
• These are not appropriate for
evaluating unsupervised clustering of texts
• The "data" is butchered text, we don't
want to fit it well
@rybesh #duketext 72
202. Duke Libraries / Text > Data September 20, 2012
Validating unsupervised methods
@rybesh #duketext 73
203. Duke Libraries / Text > Data September 20, 2012
Validating unsupervised methods
• Does the categorization make sense?
@rybesh #duketext 73
204. Duke Libraries / Text > Data September 20, 2012
Validating unsupervised methods
• Does the categorization make sense?
• Are the categories distinct?
@rybesh #duketext 73
205. Duke Libraries / Text > Data September 20, 2012
Validating unsupervised methods
• Does the categorization make sense?
• Are the categories distinct?
• Are they internally consistent?
@rybesh #duketext 73
206. Duke Libraries / Text > Data September 20, 2012
Validating unsupervised methods
• Does the categorization make sense?
• Are the categories distinct?
• Are they internally consistent?
• Do they provide insight?
@rybesh #duketext 73
207. Duke Libraries / Text > Data September 20, 2012
Validating topic coherence
{ dog, cat, horse, apple, pig, cow }
Chang et al. 2009
http://goo.gl/FCizP
@rybesh #duketext 74
208. Duke Libraries / Text > Data September 20, 2012
Validating topic coherence
{ dog, cat, horse, apple, pig, cow }
Chang et al. 2009
http://goo.gl/FCizP
@rybesh #duketext 74
209. Duke Libraries / Text > Data September 20, 2012
Validating topic coherence
{ dog, cat, horse, apple, pig, cow }
{ car, teacher, platypus, agile, blue, Zaire }
Chang et al. 2009
http://goo.gl/FCizP
@rybesh #duketext 74
210. Duke Libraries / Text > Data September 20, 2012
Validating topic coherence
{ dog, cat, horse, apple, pig, cow }
{ car, teacher, platypus, agile, blue, Zaire }
? Chang et al. 2009
http://goo.gl/FCizP
@rybesh #duketext 74
211. Duke Libraries / Text > Data September 20, 2012
Validating topic assignment
@rybesh #duketext 75
212. Duke Libraries / Text > Data September 20, 2012
Validating topic assignment
@rybesh #duketext 75
213. Duke Libraries / Text > Data September 20, 2012
Validating unsupervised methods
@rybesh #duketext 76
214. Duke Libraries / Text > Data September 20, 2012
Validating unsupervised methods
• Compared to other (manual) categorizations,
how well does this one approximate judgments
of document relatedness?
@rybesh #duketext 76
215. Duke Libraries / Text > Data September 20, 2012
Validating unsupervised methods
• Compared to other (manual) categorizations,
how well does this one approximate judgments
of document relatedness?
• Do the categories correlate with external facts?
@rybesh #duketext 76
216. Duke Libraries / Text > Data September 20, 2012
Validating unsupervised methods
• Compared to other (manual) categorizations,
how well does this one approximate judgments
of document relatedness?
• Do the categories correlate with external facts?
• Turn the categories into a coding scheme and
apply supervised methods
@rybesh #duketext 76
217. Duke Libraries / Text > Data September 20, 2012
Managing data
Helping others stand on your shoulders
@rybesh #duketext 77
218. Duke Libraries / Text > Data September 20, 2012
Three kinds of data
@rybesh #duketext 78
219. Duke Libraries / Text > Data September 20, 2012
Three kinds of data
1. The texts you're analyzing and derivations
thereof
@rybesh #duketext 78
220. Duke Libraries / Text > Data September 20, 2012
Three kinds of data
1. The texts you're analyzing and derivations
thereof
2. The software code you're using to process
and analyze your texts
@rybesh #duketext 78
221. Duke Libraries / Text > Data September 20, 2012
Three kinds of data
1. The texts you're analyzing and derivations
thereof
2. The software code you're using to process
and analyze your texts
3. Documentation of your process
@rybesh #duketext 78
222. Duke Libraries / Text > Data September 20, 2012
Textual data
@rybesh #duketext 79
223. Duke Libraries / Text > Data September 20, 2012
Textual data
• You want to keep all intermediate versions
of the texts you're processing
@rybesh #duketext 79
224. Duke Libraries / Text > Data September 20, 2012
Textual data
• You want to keep all intermediate versions
of the texts you're processing
• A version control system is ideal for this
@rybesh #duketext 79
225. Duke Libraries / Text > Data September 20, 2012
Textual data
• You want to keep all intermediate versions
of the texts you're processing
• A version control system is ideal for this
• Version control hosting platforms such as
GitHub are ideal for sharing your data too
@rybesh #duketext 79
226. Duke Libraries / Text > Data September 20, 2012
Software data
@rybesh #duketext 80
227. Duke Libraries / Text > Data September 20, 2012
Software data
• Ideally, use open-source software
@rybesh #duketext 80
228. Duke Libraries / Text > Data September 20, 2012
Software data
• Ideally, use open-source software
• Keep past versions of whatever software
you use
@rybesh #duketext 80
229. Duke Libraries / Text > Data September 20, 2012
Software data
• Ideally, use open-source software
• Keep past versions of whatever software
you use
• Use version control for your own scripts
and software
@rybesh #duketext 80
230. Duke Libraries / Text > Data September 20, 2012
Documentary data
@rybesh #duketext 81
231. Duke Libraries / Text > Data September 20, 2012
Documentary data
• This is the hardest data to manage
@rybesh #duketext 81
232. Duke Libraries / Text > Data September 20, 2012
Documentary data
• This is the hardest data to manage
• Consider keeping a (public or private)
"lab notebook" blog
@rybesh #duketext 81
233. Duke Libraries / Text > Data September 20, 2012
Documentary data
• This is the hardest data to manage
• Consider keeping a (public or private)
"lab notebook" blog
• Anything else you write related to the
project, formal or informal
@rybesh #duketext 81
234. Duke Libraries / Text > Data September 20, 2012
Long-term preservation
@rybesh #duketext 82
235. Duke Libraries / Text > Data September 20, 2012
Long-term preservation
• Data under version control can be exported,
including all versions
@rybesh #duketext 82
236. Duke Libraries / Text > Data September 20, 2012
Long-term preservation
• Data under version control can be exported,
including all versions
• Create static snapshots of websites, blogs, etc.
@rybesh #duketext 82
237. Duke Libraries / Text > Data September 20, 2012
Long-term preservation
• Data under version control can be exported,
including all versions
• Create static snapshots of websites, blogs, etc.
• Place everything in a long-term digital
repository such as DukeSpace
@rybesh #duketext 82
238. Duke Libraries / Text > Data September 20, 2012
Take-aways
@rybesh #duketext 83
239. Duke Libraries / Text > Data September 20, 2012
Take-aways
• Text analysis can be a powerful tool.
@rybesh #duketext 83
240. Duke Libraries / Text > Data September 20, 2012
Take-aways
• Text analysis can be a powerful tool.
• It's a systematic method of transforming
texts to produce new texts for interpretation.
@rybesh #duketext 83
241. Duke Libraries / Text > Data September 20, 2012
Take-aways
• Text analysis can be a powerful tool.
• It's a systematic method of transforming
texts to produce new texts for interpretation.
• It only augments human judgment and
interpretation; it can't replace them.
@rybesh #duketext 83
242. Duke Libraries / Text > Data September 20, 2012
Take-aways
• Text analysis can be a powerful tool.
• It's a systematic method of transforming
texts to produce new texts for interpretation.
• It only augments human judgment and
interpretation; it can't replace them.
• Be excited by the possibilities
but skeptical of the hype.
@rybesh #duketext 83
243. Duke Libraries / Text > Data September 20, 2012
Thanks!
@rybesh #duketext 84
244. Duke Libraries / Text > Data September 20, 2012
Thanks!
http://aesh.in/RC
@rybesh #duketext 84
245. Duke Libraries / Text > Data September 20, 2012
Thanks!
http://aesh.in/RC
ryanshaw@unc.edu
@rybesh #duketext 84
Editor's Notes
\n
\n
\n
\n
\n
\n
\n
\n
1949 - persuaded IBM to sponsor his project to produce a complete concordance of the works of St. Thomas Aquinas\n30 years\nNot new -- what's new is that it has become affordable, in both money and time\n
title of this workshop mentions "text mining", i prefer\n
\n
through a process of abstraction...\n
through a process of abstraction...\n
through a process of abstraction...\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
We computed a “suppression index” for each person by dividing their frequency from 1933 – 1945 by the mean frequency in 1925-1933 and in 1955-1965.\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
designed to capture the sentiment of political texts\n
designed to capture the sentiment of political texts\n