SlideShare a Scribd company logo
1 of 245
Duke Libraries / Text > Data                                           September 20, 2012




                                   Text-mining
                               as a Research Tool
                               in the Humanities and Social Sciences



                                           Ryan Shaw
                                       ryanshaw@unc.edu
                                        http://aesh.in/RC

@rybesh #duketext                                                                      1
Duke Libraries / Text > Data                                           September 20, 2012




                                   Text-mining
                               as a Research Tool
                               in the Humanities and Social Sciences



                                           Ryan Shaw
                                       ryanshaw@unc.edu
                                        http://aesh.in/RC

@rybesh #duketext                                                                      1
Duke Libraries / Text > Data                                           September 20, 2012




                                   Text-mining
                               as a Research Tool
                               in the Humanities and Social Sciences



                                           Ryan Shaw
                                       ryanshaw@unc.edu
                                        http://aesh.in/RC

@rybesh #duketext                                                                      1
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                              2
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                              2
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                              2
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                              2
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                              2
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                              2
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                              2
Duke Libraries / Text > Data                  September 20, 2012




                               Roberto Busa




@rybesh #duketext                                             3
Duke Libraries / Text > Data                       September 20, 2012




                         Automated text analysis




@rybesh #duketext                                                  4
Duke Libraries / Text > Data                                                   September 20, 2012




                         Automated text analysis


           Automated text analysis is a tool for discovery
           and measurement in textual data of prevalent
           attitudes, concepts, or events.


                                            O'Connor, Bamman & Smith 2011
                                "Computational Text Analysis for Social Science"
                                                              http://goo.gl/PxruI



@rybesh #duketext                                                                              4
Duke Libraries / Text > Data                                                  September 20, 2012




                         Automated text analysis

           Automated text analysis is a tool for discovery
           and measurement in textual data of patterns
           of language use interpretable as
           prevalent attitudes, concepts, or events.

                                            O'Connor, Bamman & Smith 2011
                                "Computational Text Analysis for Social Science"
                                                              http://goo.gl/PxruI



@rybesh #duketext                                                                             5
Duke Libraries / Text > Data                                          September 20, 2012




                               Language modeling




                                        Black 1962, "Models and Archetypes"
                                                        http://goo.gl/zKtrx
@rybesh #duketext                                                                     6
Duke Libraries / Text > Data                                           September 20, 2012




                               Language modeling

               • Methods for automated text analysis are
                     based on mathematical models of language




                                         Black 1962, "Models and Archetypes"
                                                         http://goo.gl/zKtrx
@rybesh #duketext                                                                      6
Duke Libraries / Text > Data                                           September 20, 2012




                               Language modeling

               • Methods for automated text analysis are
                     based on mathematical models of language
               • Mathematical models distinguish elements
                     and make explicit the relations among them



                                         Black 1962, "Models and Archetypes"
                                                         http://goo.gl/zKtrx
@rybesh #duketext                                                                      6
Duke Libraries / Text > Data                                           September 20, 2012




                               Language modeling

               • Methods for automated text analysis are
                     based on mathematical models of language
               • Mathematical models distinguish elements
                     and make explicit the relations among them
               • They do not explain, but they can be
                     interpreted

                                         Black 1962, "Models and Archetypes"
                                                         http://goo.gl/zKtrx
@rybesh #duketext                                                                      6
Duke Libraries / Text > Data                                             September 20, 2012




                               Language modeling




                                       Grimmer & Stewart 2012, "Text as Data"
                                                           http://goo.gl/tFPFs
@rybesh #duketext                                                                        7
Duke Libraries / Text > Data                                               September 20, 2012




                               Language modeling

               • All mathematical models of language are
                     necessarily wrong




                                         Grimmer & Stewart 2012, "Text as Data"
                                                             http://goo.gl/tFPFs
@rybesh #duketext                                                                          7
Duke Libraries / Text > Data                                               September 20, 2012




                               Language modeling

               • All mathematical models of language are
                     necessarily wrong

               • Nevertheless they may be useful


                                         Grimmer & Stewart 2012, "Text as Data"
                                                             http://goo.gl/tFPFs
@rybesh #duketext                                                                          7
Duke Libraries / Text > Data                                               September 20, 2012




                               Language modeling

               • All mathematical models of language are
                     necessarily wrong

               • Nevertheless they may be useful
               • They must be evaluated on their ability to
                     help scholars make inferences, achieve
                     insights, and generate new interpretations

                                         Grimmer & Stewart 2012, "Text as Data"
                                                             http://goo.gl/tFPFs
@rybesh #duketext                                                                          7
Duke Libraries / Text > Data                    September 20, 2012




                               Plan of attack




@rybesh #duketext                                               8
Duke Libraries / Text > Data                    September 20, 2012




                               Plan of attack

               • Acquiring text




@rybesh #duketext                                               8
Duke Libraries / Text > Data                    September 20, 2012




                               Plan of attack

               • Acquiring text
               • Representing text



@rybesh #duketext                                               8
Duke Libraries / Text > Data                    September 20, 2012




                               Plan of attack

               • Acquiring text
               • Representing text
               • Analyzing text


@rybesh #duketext                                               8
Duke Libraries / Text > Data                    September 20, 2012




                               Plan of attack

               • Acquiring text
               • Representing text
               • Analyzing text
               • Validating results

@rybesh #duketext                                               8
Duke Libraries / Text > Data                    September 20, 2012




                               Plan of attack

               • Acquiring text
               • Representing text
               • Analyzing text
               • Validating results
               • Managing data
@rybesh #duketext                                               8
Duke Libraries / Text > Data                            September 20, 2012




                               Acquiring text
                                 Collecting your data




@rybesh #duketext                                                       9
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             10
Duke Libraries / Text > Data             September 20, 2012




                               Sources




@rybesh #duketext                                       11
Duke Libraries / Text > Data                September 20, 2012




                               Sources


               • Existing digital corpora



@rybesh #duketext                                          11
Duke Libraries / Text > Data                                 September 20, 2012




                               Sources


               • Existing digital corpora
               • Other digital sources (e.g. Web, twitter)


@rybesh #duketext                                                           11
Duke Libraries / Text > Data                                 September 20, 2012




                               Sources


               • Existing digital corpora
               • Other digital sources (e.g. Web, twitter)
               • Undigitized text

@rybesh #duketext                                                           11
Duke Libraries / Text > Data                          September 20, 2012




                           Existing digital corpora




@rybesh #duketext                                                    12
Duke Libraries / Text > Data                               September 20, 2012




                           Existing digital corpora

               • Ideally, texts will be available as XML




@rybesh #duketext                                                         12
Duke Libraries / Text > Data                               September 20, 2012




                           Existing digital corpora

               • Ideally, texts will be available as XML
               • Quality of text and metadata is high



@rybesh #duketext                                                         12
Duke Libraries / Text > Data                               September 20, 2012




                           Existing digital corpora

               • Ideally, texts will be available as XML
               • Quality of text and metadata is high
               • But collections tend to be small


@rybesh #duketext                                                         12
Duke Libraries / Text > Data                               September 20, 2012




                           Existing digital corpora

               • Ideally, texts will be available as XML
               • Quality of text and metadata is high
               • But collections tend to be small
               • Licensing agreements may prohibit
                     text analysis


@rybesh #duketext                                                         12
Duke Libraries / Text > Data                                                September 20, 2012




                                                •   10.5 million total volumes

                                                •   5.5 million book titles

                                                •   270,000 serial titles

                                                •   3.2 million public domain




                               http://www.hathitrust.org/htrc
@rybesh #duketext                                                                          13
Duke Libraries / Text > Data                           September 20, 2012




                               Other digital sources




@rybesh #duketext                                                     14
Duke Libraries / Text > Data                                September 20, 2012




                               Other digital sources
               • Some kinds of texts (e.g. tweets) can be
                     obtained through an API




@rybesh #duketext                                                          14
Duke Libraries / Text > Data                                September 20, 2012




                               Other digital sources
               • Some kinds of texts (e.g. tweets) can be
                     obtained through an API
               • Websites without APIs can be "scraped"



@rybesh #duketext                                                          14
Duke Libraries / Text > Data                                September 20, 2012




                               Other digital sources
               • Some kinds of texts (e.g. tweets) can be
                     obtained through an API
               • Websites without APIs can be "scraped"
               • Generally requires custom programming


@rybesh #duketext                                                          14
Duke Libraries / Text > Data                                 September 20, 2012




                               Other digital sources
               • Some kinds of texts (e.g. tweets) can be
                     obtained through an API
               • Websites without APIs can be "scraped"
               • Generally requires custom programming
               • Website restrictions may limit how much
                     or how quickly texts can be collected


@rybesh #duketext                                                           14
Duke Libraries / Text > Data                                 September 20, 2012




                               Other digital sources
               • Some kinds of texts (e.g. tweets) can be
                     obtained through an API
               • Websites without APIs can be "scraped"
               • Generally requires custom programming
               • Website restrictions may limit how much
                     or how quickly texts can be collected
               • Metadata will be limited or absent
@rybesh #duketext                                                           14
Duke Libraries / Text > Data                      September 20, 2012




                               Undigitized text




@rybesh #duketext                                                15
Duke Libraries / Text > Data                               September 20, 2012




                               Undigitized text

               • Undigitized text must be scanned and
                     subjected to Optical Character Recognition




@rybesh #duketext                                                         15
Duke Libraries / Text > Data                               September 20, 2012




                               Undigitized text

               • Undigitized text must be scanned and
                     subjected to Optical Character Recognition
               • Time and labor intensive



@rybesh #duketext                                                         15
Duke Libraries / Text > Data                               September 20, 2012




                               Undigitized text

               • Undigitized text must be scanned and
                     subjected to Optical Character Recognition
               • Time and labor intensive
               • OCR will introduce errors in your texts


@rybesh #duketext                                                         15
Duke Libraries / Text > Data                               September 20, 2012




                               Undigitized text

               • Undigitized text must be scanned and
                     subjected to Optical Character Recognition
               • Time and labor intensive
               • OCR will introduce errors in your texts
               • You need to produce your own metadata

@rybesh #duketext                                                         15
Duke Libraries / Text > Data                     September 20, 2012




                               Preparing texts




@rybesh #duketext                                               16
Duke Libraries / Text > Data                     September 20, 2012




                               Preparing texts


               • OCR errors




@rybesh #duketext                                               16
Duke Libraries / Text > Data                     September 20, 2012




                               Preparing texts


               • OCR errors
               • Words broken across lines



@rybesh #duketext                                               16
Duke Libraries / Text > Data                     September 20, 2012




                               Preparing texts


               • OCR errors
               • Words broken across lines
               • Running headers and footers


@rybesh #duketext                                               16
Duke Libraries / Text > Data                                 September 20, 2012




                               Preparing texts


               • OCR errors
               • Words broken across lines
               • Running headers and footers
               • Breaking into paragraphs, sentences, etc.

@rybesh #duketext                                                           16
Duke Libraries / Text > Data                     September 20, 2012




                               Preparing texts




@rybesh #duketext                                               17
Duke Libraries / Text > Data                           September 20, 2012




                               Preparing texts

            • The bulk of your time will be spent
                  acquiring and preparing your texts




@rybesh #duketext                                                     17
Duke Libraries / Text > Data                           September 20, 2012




                               Preparing texts

            • The bulk of your time will be spent
                  acquiring and preparing your texts
            • Worth your time to learn a scripting
                  language (such as Python)




@rybesh #duketext                                                     17
Duke Libraries / Text > Data                            September 20, 2012




                               Preparing texts

            • The bulk of your time will be spent
                  acquiring and preparing your texts
            • Worth your time to learn a scripting
                  language (such as Python)
            • Command-line text-processing tools
                  on Mac OS and Unix also very useful


@rybesh #duketext                                                      17
Duke Libraries / Text > Data                                September 20, 2012




                          Representing text
                               Turning words into numbers




@rybesh #duketext                                                          18
Duke Libraries / Text > Data                                September 20, 2012




               Slowly welling from the point of her gold nib,
               pale blue ink dissolved the full stop; for there
               her pen stuck; her eyes fixed, and tears slowly
               filled them. The entire bay quivered; the
               lighthouse wobbled; and she had the illusion
               that the mast of Mr. Connor's little yacht was
               bending like a wax candle in the sun. She
               winked quickly. Accidents were awful things.
               She winked again. The mast was straight; the
               waves were regular; the lighthouse was upright;
               but the blot had spread.

@rybesh #duketext                                                          19
Duke Libraries / Text > Data                              September 20, 2012




            11       the          1   wax        1   quivered
             3       was          1   waves      1   quickly
             3       she          1   upright    1   point
             3       her          1   things     1   pen
             2       winked       1   there      1   pale
             2       were         1   them       1   nib
             2       slowly       1   that       1   mr
             2       of           1   tears      1   little
             2       mast         1   sun        1   like
             2       lighthouse   1   stuck      1   ink
             2       had          1   straight   1   in
             2       and          1   stop       1   illusion
             1       yacht        1   spread     1   gold
             1       wobbled      1   s          1   full
             1       welling      1   regular    1   from
@rybesh #duketext                                                        20
Duke Libraries / Text > Data                             September 20, 2012




            11       the         1   wax        1   quiver
             3       wa          1   wave       1   quickli
             3       she         1   upright    1   point
             3       her         1   thing      1   pen
             2       wink        1   there      1   pale
             2       were        1   them       1   nib
             2       slowli      1   that       1   mr
             2       of          1   tear       1   littl
             2       mast        1   sun        1   like
             2       lighthous   1   stuck      1   ink
             2       had         1   straight   1   in
             2       and         1   stop       1   illus
             1       yacht       1   spread     1   gold
             1       wobbl       1   s          1   full
             1       well        1   regular    1   from
@rybesh #duketext                                                       21
Duke Libraries / Text > Data                             September 20, 2012




            11       the         1   wax        1   quiver
             3       wa          1   wave       1   quickli
             3       she         1   upright    1   point
             3       her         1   thing      1   pen
             2       wink        1   there      1   pale
             2       were        1   them       1   nib
             2       slowli      1   that       1   mr
             2       of          1   tear       1   littl
             2       mast        1   sun        1   like
             2       lighthous   1   stuck      1   ink
             2       had         1   straight   1   in
             2       and         1   stop       1   illus
             1       yacht       1   spread     1   gold
             1       wobbl       1   s          1   full
             1       well        1   regular    1   from
@rybesh #duketext                                                       22
Duke Libraries / Text > Data                                   September 20, 2012


                               doc 1 doc 2 doc 3 doc 4 doc 5 doc 6
       accid                    1
      actual                                             1
       again                    1                              1
     alreadi                          1
     antenna                                             1
      archer                          1
       avoid                    2                              1
        awai                                                   1
          aw                    1
         bag                                       1
    bandanna                                                   1
     barfoot                                2
@rybesh #duketext                                                             23
Duke Libraries / Text > Data                         September 20, 2012




                               Document similarity
                               2




            again
                               1




                                       1      2

@rybesh #duketext
                                      avoid                         24
Duke Libraries / Text > Data                          September 20, 2012




                               Document similarity
                               2




            again                             doc 1
                               1




                                       1        2

@rybesh #duketext
                                      avoid                          24
Duke Libraries / Text > Data                          September 20, 2012




                               Document similarity
                               2




            again                    doc 6    doc 1
                               1




                                       1        2

@rybesh #duketext
                                      avoid                          24
Duke Libraries / Text > Data                                     September 20, 2012




                               Document similarity
                               2




            again                        doc 6           doc 1
                               1

                                                ar ity
                                         m il
                                    si


                                                1          2

@rybesh #duketext
                                          avoid                                 24
Duke Libraries / Text > Data                                    September 20, 2012




                               Analyzing text
               Counting, comparing, categorizing and pattern-finding




@rybesh #duketext                                                              25
Duke Libraries / Text > Data                                               September 20, 2012




                  Six methods of text analysis




                                                                 Quinn et al. 2010
                               http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext                                                                         26
Duke Libraries / Text > Data                                               September 20, 2012




                  Six methods of text analysis
               • Reading




                                                                 Quinn et al. 2010
                               http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext                                                                         26
Duke Libraries / Text > Data                                               September 20, 2012




                  Six methods of text analysis
               • Reading
               • Counting words



                                                                 Quinn et al. 2010
                               http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext                                                                         26
Duke Libraries / Text > Data                                               September 20, 2012




                  Six methods of text analysis
               • Reading
               • Counting words
               • Human coding (manual content analysis)


                                                                 Quinn et al. 2010
                               http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext                                                                         26
Duke Libraries / Text > Data                                               September 20, 2012




                  Six methods of text analysis
               • Reading
               • Counting words
               • Human coding (manual content analysis)
               • Dictionary methods

                                                                 Quinn et al. 2010
                               http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext                                                                         26
Duke Libraries / Text > Data                                               September 20, 2012




                  Six methods of text analysis
               • Reading
               • Counting words
               • Human coding (manual content analysis)
               • Dictionary methods
               • Supervised machine learning
                                                                 Quinn et al. 2010
                               http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext                                                                         26
Duke Libraries / Text > Data                                               September 20, 2012




                  Six methods of text analysis
               • Reading
               • Counting words
               • Human coding (manual content analysis)
               • Dictionary methods
               • Supervised machine learning
               • Unsupervised machine learning
                                                                 Quinn et al. 2010
                               http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext                                                                         26
Duke Libraries / Text > Data                                               September 20, 2012




                  Six methods of text analysis
               • Reading
               • Counting words
               • Human coding (manual content analysis)
               • Dictionary methods
               • Supervised machine learning
               • Unsupervised machine learning
                                                                 Quinn et al. 2010
                               http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x
@rybesh #duketext                                                                         27
Duke Libraries / Text > Data                                                          September 20, 2012



                                     Counting words




                               http://www.nytimes.com/ref/washington/20070123_STATEOFUNION.html
@rybesh #duketext                                                                                    28
Duke Libraries / Text > Data                    September 20, 2012




                               Counting words




@rybesh #duketext                                              29
Duke Libraries / Text > Data                                       September 20, 2012




                                                        Michel et al. 2010
@rybesh #duketext              http://dx.doi.org/10.1126/science.1199644          30
Duke Libraries / Text > Data                    September 20, 2012




                               Counting words




@rybesh #duketext                                              31
Duke Libraries / Text > Data                        September 20, 2012




                                   Counting words
                               • Easily computed




@rybesh #duketext                                                  31
Duke Libraries / Text > Data                              September 20, 2012




                                   Counting words
                               • Easily computed
                               • Results are replicable




@rybesh #duketext                                                        31
Duke Libraries / Text > Data                                        September 20, 2012




                                   Counting words
                               • Easily computed
                               • Results are replicable
                               • Comparisons require metadata
                                 e.g. publication year, language,
                                 subject category, location




@rybesh #duketext                                                                  31
Duke Libraries / Text > Data                                        September 20, 2012




                                   Counting words
                               • Easily computed
                               • Results are replicable
                               • Comparisons require metadata
                                 e.g. publication year, language,
                                 subject category, location
                               • Word use is ambiguous

@rybesh #duketext                                                                  31
Duke Libraries / Text > Data                                        September 20, 2012




                                   Counting words
                               • Easily computed
                               • Results are replicable
                               • Comparisons require metadata
                                 e.g. publication year, language,
                                 subject category, location
                               • Word use is ambiguous
                               • Spelling may vary
@rybesh #duketext                                                                  31
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             32
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             33
Duke Libraries / Text > Data                       September 20, 2012




                               Concordance tools




@rybesh #duketext                                                 34
Duke Libraries / Text > Data                        September 20, 2012




                               Dictionary methods




@rybesh #duketext                                                  35
Duke Libraries / Text > Data                              September 20, 2012




                               Dictionary methods


               • A dictionary is simply a list of words




@rybesh #duketext                                                        35
Duke Libraries / Text > Data                                  September 20, 2012




                               Dictionary methods


               • A dictionary is simply a list of words
               • Lists are compiled for specific categories of
                     interest: negative words, law-related words,
                     names of places, names of chemicals, etc.




@rybesh #duketext                                                            35
Duke Libraries / Text > Data                                  September 20, 2012




                               Dictionary methods


               • A dictionary is simply a list of words
               • Lists are compiled for specific categories of
                     interest: negative words, law-related words,
                     names of places, names of chemicals, etc.
               • May be custom-built or reused

@rybesh #duketext                                                            35
Duke Libraries / Text > Data                                       September 20, 2012




            Lexicoder Sentiment Dictionary
A LIE 0                        WOUNDED 0     ABILITY* 1     WOOS 1
ABANDON* 0                     WOUNDS 0      ABOUND* 1      WORKABLE* 1
ABAS* 0                        WRATH* 0      ABSOLV* 1      WORKMANSHIP* 1
ABATTOIR* 0                    WRECK* 0      ABSORBENT* 1   WORSHIP* 1
ABDICAT* 0                     WRESTL* 0     ABSORPTION* 1 WORTH 1
ABERRA* 0                      WRETCH* 0     ABUNDANC* 1    WORTH WHILE* 1
ABHOR* 0                       WRITHE* 0     ABUNDANT* 1    WORTHI* 1
ABJECT* 0                      WRONG* 0      ACCED* 1       WORTHWHILE* 1
ABNORMAL* 0                    XENOPHOB* 0   ACCENTUAT* 1   WORTHY* 1
ABOLISH* 0                     YAWN* 0       ACCEPT* 1      YOUNG AT HEART 1
ABOMINAB* 0                    YEARN* 0      ACCESSIB* 1    ZEAL 1
ABOMINAT* 0                    YUCK* 0       ACCLAIM* 1     ZEALOUS* 1
ABRASIV* 0                     ZEALOT* 0     ACCLAMATION* 1 ZEST* 1

@rybesh #duketext                                                                 36
Duke Libraries / Text > Data                  September 20, 2012




ACQUITTANCE DOCKET          LEGALIZATIONS QUITCLAIM
ADJOURNING ESCHEATED        LEGALLY       REBUTS
APPELLANTS    EXCEEDENCES   LITIGATORS    REQUESTER
APPOINTOR     EXCULPATED    MISTRIALS     RESCINDS
ARBITRATE     FOREBEAR      NOTARIZE      STATUTE
ASSERTABLE    INASMUCH      NOTARIZED     SUBPARAGRAPHS
CHATTEL       INDEMNITY     OBLIGOR       SUBPOENAS
CODIFICATIONS INJUNCTION    PERSONAM      SUBTRUSTS
CONVICTED     INTERLOCUTORY PLEADS        TENANTABILITY
COUNTERSUIT INTERPLEADER POSTJUDGMENT TESTAMENTARY
DEFEASANCE    INTERROGATE   PRETRIAL      UNENCUMBERED
DELEGATEE     IRREVOCABLY   PRIMA         UNREMEDIATED
DEPOSED       LEGALIZATION PROSECUTIONS WHEREOF

@rybesh #duketext                                            37
Duke Libraries / Text > Data                     September 20, 2012




                               Litigious Words
ACQUITTANCE DOCKET          LEGALIZATIONS QUITCLAIM
ADJOURNING ESCHEATED        LEGALLY       REBUTS
APPELLANTS    EXCEEDENCES   LITIGATORS    REQUESTER
APPOINTOR     EXCULPATED    MISTRIALS     RESCINDS
ARBITRATE     FOREBEAR      NOTARIZE      STATUTE
ASSERTABLE    INASMUCH      NOTARIZED     SUBPARAGRAPHS
CHATTEL       INDEMNITY     OBLIGOR       SUBPOENAS
CODIFICATIONS INJUNCTION    PERSONAM      SUBTRUSTS
CONVICTED     INTERLOCUTORY PLEADS        TENANTABILITY
COUNTERSUIT INTERPLEADER POSTJUDGMENT TESTAMENTARY
DEFEASANCE    INTERROGATE   PRETRIAL      UNENCUMBERED
DELEGATEE     IRREVOCABLY   PRIMA         UNREMEDIATED
DEPOSED       LEGALIZATION PROSECUTIONS WHEREOF

@rybesh #duketext                                               37
Duke Libraries / Text > Data                 September 20, 2012




                   Simple dictionary algorithm




@rybesh #duketext                                           38
Duke Libraries / Text > Data                  September 20, 2012




                   Simple dictionary algorithm

               • For each word in document:




@rybesh #duketext                                            38
Duke Libraries / Text > Data                               September 20, 2012




                   Simple dictionary algorithm

               • For each word in document:
                • +1 if the word is in the positive list



@rybesh #duketext                                                         38
Duke Libraries / Text > Data                               September 20, 2012




                   Simple dictionary algorithm

               • For each word in document:
                • +1 if the word is in the positive list
                • –1 if the word is in the negative list


@rybesh #duketext                                                         38
Duke Libraries / Text > Data                           September 20, 2012




                   Simple dictionary algorithm

               • For each word in document:
                • +1 if the word is in the positive list
                • –1 if the word is in the negative list
               • Divide the total by the number of words

@rybesh #duketext                                                     38
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             39
Duke Libraries / Text > Data                               September 20, 2012




                               26 uses of positive words




@rybesh #duketext                                                         40
Duke Libraries / Text > Data                               September 20, 2012




                               26 uses of positive words
                                          –
                               51 uses of negative words




@rybesh #duketext                                                         40
Duke Libraries / Text > Data                               September 20, 2012




                               26 uses of positive words
                                          –
                               51 uses of negative words
                                          =
                                         –25




@rybesh #duketext                                                         40
Duke Libraries / Text > Data                               September 20, 2012




                               26 uses of positive words
                                          –
                               51 uses of negative words


                                 –25 / 779 total words




@rybesh #duketext                                                         40
Duke Libraries / Text > Data                               September 20, 2012




                               26 uses of positive words
                                          –
                               51 uses of negative words


                                 –25 / 779 total words
                                          =
                                        –0.032

@rybesh #duketext                                                         40
Duke Libraries / Text > Data                                      September 20, 2012



        AGAINST                LIMITED
        AGGRESSIVENESS         LIMITING
        ATTACK                 NEGATE
        ATTACKING              OFFENSE
        CHALLENGE              OFFENSIVE      ADEQUATELY    IMPROVEMENT
        CONTRAST               OFFENSIVELY    ADVANTAGE     KEEPING
        DEFENSIVE              OPPOSING       ASSISTS       LIKE
        DEFICIENCIES           PLAGUED        EFFICIENT     PATRIOT
        DEVIL                  POOR           EFFICIENTLY   PERFECT
        DEVILS                 PROBLEM        EFFORT        RESPONSIBLE
        DISMAL                 SHORTCOMINGS   FREE          SIGNIFICANT
        EXPLOIT                SLUGGISH       FRESHMAN      STRONGER
        FAILED                 THORNTON       GOOD          SUCCESS
        FOUL                   THREATS        GREAT         WELL
        FOULING                TOO
        FOULS                  TROUBLE
        FUTILITY               TROUBLES
        INABILITY              UNABLE




@rybesh #duketext                                                                41
Duke Libraries / Text > Data                                      September 20, 2012



        AGAINST                LIMITED
        AGGRESSIVENESS         LIMITING
        ATTACK                 NEGATE
        ATTACKING              OFFENSE
        CHALLENGE              OFFENSIVE      ADEQUATELY    IMPROVEMENT
        CONTRAST               OFFENSIVELY    ADVANTAGE     KEEPING
        DEFENSIVE              OPPOSING       ASSISTS       LIKE
        DEFICIENCIES           PLAGUED        EFFICIENT     PATRIOT
        DEVIL                  POOR           EFFICIENTLY   PERFECT
        DEVILS                 PROBLEM        EFFORT        RESPONSIBLE
        DISMAL                 SHORTCOMINGS   FREE          SIGNIFICANT
        EXPLOIT                SLUGGISH       FRESHMAN      STRONGER
        FAILED                 THORNTON       GOOD          SUCCESS
        FOUL                   THREATS        GREAT         WELL
        FOULING                TOO
        FOULS                  TROUBLE
        FUTILITY               TROUBLES
        INABILITY              UNABLE




@rybesh #duketext                                                                42
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             43
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             43
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             43
Duke Libraries / Text > Data                September 20, 2012




                  Supervised machine learning




@rybesh #duketext                                          44
Duke Libraries / Text > Data                           September 20, 2012




                  Supervised machine learning

               • The situation:
                     you know the categories of interest




@rybesh #duketext                                                     44
Duke Libraries / Text > Data                           September 20, 2012




                  Supervised machine learning

               • The situation:
                     you know the categories of interest

               • The problem:
                     human coding of documents doesn't scale




@rybesh #duketext                                                     44
Duke Libraries / Text > Data                           September 20, 2012




                  Supervised machine learning

               • The situation:
                     you know the categories of interest

               • The problem:
                     human coding of documents doesn't scale
               • The solution:
                     teach a robot to do it


@rybesh #duketext                                                     44
Duke Libraries / Text > Data   September 20, 2012



          Welcome your
         robot overlords




@rybesh #duketext                             45
Duke Libraries / Text > Data   September 20, 2012



          Welcome your
         robot overlords




@rybesh #duketext                             45
Duke Libraries / Text > Data              September 20, 2012




                  Augmenting human capacity




@rybesh #duketext                                        46
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             47
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             47
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             47
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             47
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             47
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             47
Duke Libraries / Text > Data                September 20, 2012




                  Supervised machine learning




@rybesh #duketext                                          48
Duke Libraries / Text > Data                September 20, 2012




                  Supervised machine learning
               1. Create a training set.




@rybesh #duketext                                          48
Duke Libraries / Text > Data                              September 20, 2012




                  Supervised machine learning
               1. Create a training set.
               2. Use the training set to "teach" a supervised
                  learning algorithm how to map document
                  features (e.g. words) to categories.




@rybesh #duketext                                                        48
Duke Libraries / Text > Data                                   September 20, 2012




                  Supervised machine learning
               1. Create a training set.
               2. Use the training set to "teach" a supervised
                  learning algorithm how to map document
                  features (e.g. words) to categories.
               3. Test your classifying machine to see if it
                  learned correctly.



@rybesh #duketext                                                             48
Duke Libraries / Text > Data                                   September 20, 2012




                  Supervised machine learning
               1. Create a training set.
               2. Use the training set to "teach" a supervised
                  learning algorithm how to map document
                  features (e.g. words) to categories.
               3. Test your classifying machine to see if it
                  learned correctly.
               4. Use it to classify the rest of your documents.

@rybesh #duketext                                                             48
Duke Libraries / Text > Data                             September 20, 2012




                               Creating a training set




@rybesh #duketext                                                       49
Duke Libraries / Text > Data                               September 20, 2012




                               Creating a training set

               • Create a coding scheme that humans can
                     use reliably and without ambiguity.




@rybesh #duketext                                                         49
Duke Libraries / Text > Data                               September 20, 2012




                               Creating a training set

               • Create a coding scheme that humans can
                     use reliably and without ambiguity.
               • Select (ideally randomly) a subset of your
                     documents, and code them by hand.




@rybesh #duketext                                                         49
Duke Libraries / Text > Data                               September 20, 2012




                               Creating a training set

               • Create a coding scheme that humans can
                     use reliably and without ambiguity.
               • Select (ideally randomly) a subset of your
                     documents, and code them by hand.
               • You need "enough" documents:
                     more categories, more documents.


@rybesh #duketext                                                         49
Duke Libraries / Text > Data              September 20, 2012




              Supervised learning algorithms




@rybesh #duketext                                        50
Duke Libraries / Text > Data                                    September 20, 2012




              Supervised learning algorithms
               • Many kinds:
                     Naïve Bayes, decision trees / random
                     forests, support vector machines, neural
                     networks, etc.




@rybesh #duketext                                                              50
Duke Libraries / Text > Data                                    September 20, 2012




              Supervised learning algorithms
               • Many kinds:
                     Naïve Bayes, decision trees / random
                     forests, support vector machines, neural
                     networks, etc.
               • No "best" one: performance is domain- and
                     dataset-specific




@rybesh #duketext                                                              50
Duke Libraries / Text > Data                                    September 20, 2012




              Supervised learning algorithms
               • Many kinds:
                     Naïve Bayes, decision trees / random
                     forests, support vector machines, neural
                     networks, etc.
               • No "best" one: performance is domain- and
                     dataset-specific
               • "Ensembles" of different algorithms can
                     often outperform single algorithms

@rybesh #duketext                                                              50
Duke Libraries / Text > Data           September 20, 2012




             Unsupervised machine learning




@rybesh #duketext                                     51
Duke Libraries / Text > Data           September 20, 2012




             Unsupervised machine learning




@rybesh #duketext                                     52
Duke Libraries / Text > Data                            September 20, 2012




             Unsupervised machine learning
               • The situation:
                     you don't know the categories of
                     interest, or want to discover new ones




@rybesh #duketext                                                      52
Duke Libraries / Text > Data                             September 20, 2012




             Unsupervised machine learning
               • The situation:
                     you don't know the categories of
                     interest, or want to discover new ones
               • The solution:
                     have a robot explore and find possible
                     categorizations for you, and use them to
                     categorize documents


@rybesh #duketext                                                       52
Duke Libraries / Text > Data                             September 20, 2012




             Unsupervised machine learning
               • The situation:
                     you don't know the categories of
                     interest, or want to discover new ones
               • The solution:
                     have a robot explore and find possible
                     categorizations for you, and use them to
                     categorize documents
               • Also known as "clustering"
@rybesh #duketext                                                       52
Duke Libraries / Text > Data                                           September 20, 2012




                               No free lunch




                                     Grimmer & Stewart 2012, "Text as Data"
                                                         http://goo.gl/tFPFs
@rybesh #duketext                                                                     53
Duke Libraries / Text > Data                                           September 20, 2012




                               No free lunch

               • No need for manual coding beforehand




                                     Grimmer & Stewart 2012, "Text as Data"
                                                         http://goo.gl/tFPFs
@rybesh #duketext                                                                     53
Duke Libraries / Text > Data                                              September 20, 2012




                               No free lunch

               • No need for manual coding beforehand
               • But as much or more manual labor
                     is needed to evaluate suggested
                     categorizations afterwards



                                        Grimmer & Stewart 2012, "Text as Data"
                                                            http://goo.gl/tFPFs
@rybesh #duketext                                                                        53
Duke Libraries / Text > Data                                              September 20, 2012




                               No free lunch

               • No need for manual coding beforehand
               • But as much or more manual labor
                     is needed to evaluate suggested
                     categorizations afterwards
               • The value is a novel categorization,
                     not time or labor saved

                                        Grimmer & Stewart 2012, "Text as Data"
                                                            http://goo.gl/tFPFs
@rybesh #duketext                                                                        53
Duke Libraries / Text > Data                           September 20, 2012



                                   Two kinds of
                               unsupervised learning




@rybesh #duketext                                                     54
Duke Libraries / Text > Data                               September 20, 2012



                                   Two kinds of
                               unsupervised learning

               • Single membership clustering:
                     each document is assigned to one category




@rybesh #duketext                                                         54
Duke Libraries / Text > Data                                 September 20, 2012



                                   Two kinds of
                               unsupervised learning

               • Single membership clustering:
                     each document is assigned to one category
               • Mixed membership clustering:
                     a document may be assigned to multiple
                     categories, each with a different proportion



@rybesh #duketext                                                           54
Duke Libraries / Text > Data              September 20, 2012




                Single membership clustering




@rybesh #duketext                                        55
Duke Libraries / Text > Data                              September 20, 2012




                Single membership clustering

               1. Define a quantitative measure of similarity
                  between documents.




@rybesh #duketext                                                        55
Duke Libraries / Text > Data                              September 20, 2012




                Single membership clustering

               1. Define a quantitative measure of similarity
                  between documents.
               2. Define a quantitative measure of how
                  "good" a cluster is.




@rybesh #duketext                                                        55
Duke Libraries / Text > Data                              September 20, 2012




                Single membership clustering

               1. Define a quantitative measure of similarity
                  between documents.
               2. Define a quantitative measure of how
                  "good" a cluster is.
               3. Define a process for optimizing the overall
                  goodness of the clusters.


@rybesh #duketext                                                        55
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             56
Duke Libraries / Text > Data   September 20, 2012




@rybesh #duketext                             56
Duke Libraries / Text > Data                 September 20, 2012




                               http://shabal.in/visuals.html
@rybesh #duketext                                           57
Duke Libraries / Text > Data                 September 20, 2012




                               http://shabal.in/visuals.html
@rybesh #duketext                                           57
Duke Libraries / Text > Data             September 20, 2012




                Mixed membership clustering




@rybesh #duketext                                       58
Duke Libraries / Text > Data                           September 20, 2012




                Mixed membership clustering

               • Topic modeling is a popular example




@rybesh #duketext                                                     58
Duke Libraries / Text > Data                           September 20, 2012




                Mixed membership clustering

               • Topic modeling is a popular example
               • Each document is modeled as a mixture of
                     categories or topics




@rybesh #duketext                                                     58
Duke Libraries / Text > Data                                September 20, 2012




                Mixed membership clustering

               • Topic modeling is a popular example
               • Each document is modeled as a mixture of
                     categories or topics
               • A document is a probability distribution
                     over topics




@rybesh #duketext                                                          58
Duke Libraries / Text > Data                                September 20, 2012




                Mixed membership clustering

               • Topic modeling is a popular example
               • Each document is modeled as a mixture of
                     categories or topics
               • A document is a probability distribution
                     over topics
               • A topic is a probability distribution
                     over words

@rybesh #duketext                                                          58
Duke Libraries / Text > Data                          September 20, 2012




                           Probability distribution




@rybesh #duketext                                                    59
Duke Libraries / Text > Data                       September 20, 2012




                               "Generating" text




@rybesh #duketext                                                 60
Duke Libraries / Text > Data                                 September 20, 2012




                               "Generating" text

               1. Roll our "topic dice" to choose a topic.




@rybesh #duketext                                                           60
Duke Libraries / Text > Data                                 September 20, 2012




                               "Generating" text

               1. Roll our "topic dice" to choose a topic.
               2. Get the "word dice" corresponding to the
                  the chosen topic.




@rybesh #duketext                                                           60
Duke Libraries / Text > Data                                 September 20, 2012




                               "Generating" text

               1. Roll our "topic dice" to choose a topic.
               2. Get the "word dice" corresponding to the
                  the chosen topic.
               3. Roll the "word dice" to choose a word.




@rybesh #duketext                                                           60
Duke Libraries / Text > Data                                 September 20, 2012




                               "Generating" text

               1. Roll our "topic dice" to choose a topic.
               2. Get the "word dice" corresponding to the
                  the chosen topic.
               3. Roll the "word dice" to choose a word.
               4. Repeat until we've chosen all the words for
                  our text.


@rybesh #duketext                                                           60
Duke Libraries / Text > Data                         September 20, 2012




                               Topic modeling demo




@rybesh #duketext                                                   61
Duke Libraries / Text > Data                           September 20, 2012




                               http://dsl.richmond.edu/dispatch/


@rybesh #duketext                                                     62
Duke Libraries / Text > Data                                                      September 20, 2012



                                Complex statistics / computation


                                       Topic models


    Weaker                                                                      Stronger
    domain                                Supervised methods                     domain
    assumptions                                                              assumptions



                               Word counting         Dictionary
                                                      methods

                                 Simple statistics / computation
@rybesh #duketext                         O'Connor, Bamman & Smith 2011 http://goo.gl/PxruI      63
Duke Libraries / Text > Data                                         September 20, 2012




                               Validating results
                      Keeping the machines from leading you astray




@rybesh #duketext                                                                   64
Duke Libraries / Text > Data                        September 20, 2012




                           Validating word counts




@rybesh #duketext                                                  65
Duke Libraries / Text > Data                           September 20, 2012




                           Validating word counts


               • Text data may have errors (e.g. from OCR)




@rybesh #duketext                                                     65
Duke Libraries / Text > Data                           September 20, 2012




                           Validating word counts


               • Text data may have errors (e.g. from OCR)
               • Metadata may have errors



@rybesh #duketext                                                     65
Duke Libraries / Text > Data                           September 20, 2012




                           Validating word counts


               • Text data may have errors (e.g. from OCR)
               • Metadata may have errors
               • Texts may appear multiple times


@rybesh #duketext                                                     65
Duke Libraries / Text > Data                           September 20, 2012




                           Validating word counts


               • Text data may have errors (e.g. from OCR)
               • Metadata may have errors
               • Texts may appear multiple times
               • Collections are biased samples

@rybesh #duketext                                                     65
Duke Libraries / Text > Data                                         September 20, 2012




                               http://languagelog.ldc.upenn.edu/nll/?p=1701

@rybesh #duketext                                                                   66
Duke Libraries / Text > Data                                         September 20, 2012




                               http://languagelog.ldc.upenn.edu/nll/?p=1701

@rybesh #duketext                                                                   66
Duke Libraries / Text > Data                                         September 20, 2012




                               http://languagelog.ldc.upenn.edu/nll/?p=1701

@rybesh #duketext                                                                   66
Duke Libraries / Text > Data                                         September 20, 2012




                               http://languagelog.ldc.upenn.edu/nll/?p=1701

@rybesh #duketext                                                                   66
Duke Libraries / Text > Data              September 20, 2012




               Validating dictionary methods




@rybesh #duketext                                        67
Duke Libraries / Text > Data                              September 20, 2012




               Validating dictionary methods


               • Must verify that dictionary categorizations
                     match human judgments




@rybesh #duketext                                                        67
Duke Libraries / Text > Data                                  September 20, 2012




               Validating dictionary methods


               • Must verify that dictionary categorizations
                     match human judgments
               • But humans can't reliably "score"
                     documents on "positivity" or "litigiousness"




@rybesh #duketext                                                            67
Duke Libraries / Text > Data                                  September 20, 2012




               Validating dictionary methods


               • Must verify that dictionary categorizations
                     match human judgments
               • But humans can't reliably "score"
                     documents on "positivity" or "litigiousness"
               • Better to convert scores to simple binaries

@rybesh #duketext                                                            67
Duke Libraries / Text > Data             September 20, 2012




              Validating supervised methods




@rybesh #duketext                                       68
Duke Libraries / Text > Data                            September 20, 2012




              Validating supervised methods

               • Ideally: take two random non-overlapping
                     samples and manually code them.




@rybesh #duketext                                                      68
Duke Libraries / Text > Data                            September 20, 2012




              Validating supervised methods

               • Ideally: take two random non-overlapping
                     samples and manually code them.
               • Use the first sample to train your
                     supervised learning algorithm.




@rybesh #duketext                                                      68
Duke Libraries / Text > Data                             September 20, 2012




              Validating supervised methods

               • Ideally: take two random non-overlapping
                     samples and manually code them.
               • Use the first sample to train your
                     supervised learning algorithm.
               • Use the second sample to evaluate its
                     performance.


@rybesh #duketext                                                       68
Duke Libraries / Text > Data                                             September 20, 2012




                                         figurative   mixed        literal

                        figurative           57        32            2


                               mixed        21        30            6


                               literal      0          4           110


@rybesh #duketext
                                                             262 documents              69
Duke Libraries / Text > Data                                             September 20, 2012




                                         figurative   mixed        literal

                        figurative           57        32            2


                               mixed        21        30            6


                               literal      0          4           110


@rybesh #duketext
                                                             262 documents              69
Duke Libraries / Text > Data                                             September 20, 2012




    Accuracy: 197 / 262 = 75%


                                         figurative   mixed        literal

                        figurative           57        32            2


                               mixed        21        30            6


                               literal      0          4           110


@rybesh #duketext
                                                             262 documents              69
Duke Libraries / Text > Data                                             September 20, 2012




       Precision: 57 / 78 = 73%
       figurative category

                                         figurative   mixed        literal

                        figurative           57        32            2


                               mixed        21        30            6


                               literal      0          4           110


@rybesh #duketext
                                                             262 documents              70
Duke Libraries / Text > Data                                             September 20, 2012




       Recall: 57 / 91 = 63%
       figurative category

                                         figurative   mixed        literal

                        figurative           57        32            2


                               mixed        21        30            6


                               literal      0          4           110


@rybesh #duketext
                                                             262 documents              71
Duke Libraries / Text > Data          September 20, 2012




          Validating unsupervised methods




@rybesh #duketext                                    72
Duke Libraries / Text > Data                                  September 20, 2012




          Validating unsupervised methods

               • There are statistical measures of how well
                     a particular clustering "fits" the data




@rybesh #duketext                                                            72
Duke Libraries / Text > Data                                  September 20, 2012




          Validating unsupervised methods

               • There are statistical measures of how well
                     a particular clustering "fits" the data
               • These are not appropriate for
                     evaluating unsupervised clustering of texts




@rybesh #duketext                                                            72
Duke Libraries / Text > Data                                  September 20, 2012




          Validating unsupervised methods

               • There are statistical measures of how well
                     a particular clustering "fits" the data
               • These are not appropriate for
                     evaluating unsupervised clustering of texts
               • The "data" is butchered text, we don't
                     want to fit it well


@rybesh #duketext                                                            72
Duke Libraries / Text > Data          September 20, 2012




          Validating unsupervised methods




@rybesh #duketext                                    73
Duke Libraries / Text > Data                           September 20, 2012




          Validating unsupervised methods


               • Does the categorization make sense?




@rybesh #duketext                                                     73
Duke Libraries / Text > Data                           September 20, 2012




          Validating unsupervised methods


               • Does the categorization make sense?
               • Are the categories distinct?



@rybesh #duketext                                                     73
Duke Libraries / Text > Data                           September 20, 2012




          Validating unsupervised methods


               • Does the categorization make sense?
               • Are the categories distinct?
               • Are they internally consistent?


@rybesh #duketext                                                     73
Duke Libraries / Text > Data                           September 20, 2012




          Validating unsupervised methods


               • Does the categorization make sense?
               • Are the categories distinct?
               • Are they internally consistent?
               • Do they provide insight?

@rybesh #duketext                                                     73
Duke Libraries / Text > Data                               September 20, 2012




                     Validating topic coherence

                 { dog, cat, horse, apple, pig, cow }




                                              Chang et al. 2009
                                             http://goo.gl/FCizP

@rybesh #duketext                                                         74
Duke Libraries / Text > Data                               September 20, 2012




                     Validating topic coherence

                 { dog, cat, horse, apple, pig, cow }




                                              Chang et al. 2009
                                             http://goo.gl/FCizP

@rybesh #duketext                                                         74
Duke Libraries / Text > Data                               September 20, 2012




                     Validating topic coherence

                 { dog, cat, horse, apple, pig, cow }


     { car, teacher, platypus, agile, blue, Zaire }
                                              Chang et al. 2009
                                             http://goo.gl/FCizP

@rybesh #duketext                                                         74
Duke Libraries / Text > Data                               September 20, 2012




                     Validating topic coherence

                 { dog, cat, horse, apple, pig, cow }


     { car, teacher, platypus, agile, blue, Zaire }
                                 ?            Chang et al. 2009
                                             http://goo.gl/FCizP

@rybesh #duketext                                                         74
Duke Libraries / Text > Data                  September 20, 2012




                    Validating topic assignment




@rybesh #duketext                                            75
Duke Libraries / Text > Data                  September 20, 2012




                    Validating topic assignment




@rybesh #duketext                                            75
Duke Libraries / Text > Data          September 20, 2012




          Validating unsupervised methods




@rybesh #duketext                                    76
Duke Libraries / Text > Data                            September 20, 2012




          Validating unsupervised methods

           • Compared to other (manual) categorizations,
                 how well does this one approximate judgments
                 of document relatedness?




@rybesh #duketext                                                      76
Duke Libraries / Text > Data                            September 20, 2012




          Validating unsupervised methods

           • Compared to other (manual) categorizations,
                 how well does this one approximate judgments
                 of document relatedness?
           • Do the categories correlate with external facts?


@rybesh #duketext                                                      76
Duke Libraries / Text > Data                            September 20, 2012




          Validating unsupervised methods

           • Compared to other (manual) categorizations,
                 how well does this one approximate judgments
                 of document relatedness?
           • Do the categories correlate with external facts?
           • Turn the categories into a coding scheme and
                 apply supervised methods


@rybesh #duketext                                                      76
Duke Libraries / Text > Data                                            September 20, 2012




                                  Managing data
                               Helping others stand on your shoulders




@rybesh #duketext                                                                      77
Duke Libraries / Text > Data                         September 20, 2012




                               Three kinds of data




@rybesh #duketext                                                   78
Duke Libraries / Text > Data                              September 20, 2012




                               Three kinds of data

               1. The texts you're analyzing and derivations
                  thereof




@rybesh #duketext                                                        78
Duke Libraries / Text > Data                              September 20, 2012




                               Three kinds of data

               1. The texts you're analyzing and derivations
                  thereof
               2. The software code you're using to process
                  and analyze your texts




@rybesh #duketext                                                        78
Duke Libraries / Text > Data                              September 20, 2012




                               Three kinds of data

               1. The texts you're analyzing and derivations
                  thereof
               2. The software code you're using to process
                  and analyze your texts
               3. Documentation of your process


@rybesh #duketext                                                        78
Duke Libraries / Text > Data                  September 20, 2012




                               Textual data




@rybesh #duketext                                            79
Duke Libraries / Text > Data                            September 20, 2012




                               Textual data


               • You want to keep all intermediate versions
                     of the texts you're processing




@rybesh #duketext                                                      79
Duke Libraries / Text > Data                                  September 20, 2012




                               Textual data


               • You want to keep all intermediate versions
                     of the texts you're processing
               • A version control system is ideal for this


@rybesh #duketext                                                            79
Duke Libraries / Text > Data                                September 20, 2012




                               Textual data


               • You want to keep all intermediate versions
                     of the texts you're processing
               • A version control system is ideal for this
               • Version control hosting platforms such as
                     GitHub are ideal for sharing your data too


@rybesh #duketext                                                          79
Duke Libraries / Text > Data                   September 20, 2012




                               Software data




@rybesh #duketext                                             80
Duke Libraries / Text > Data                         September 20, 2012




                               Software data


               • Ideally, use open-source software




@rybesh #duketext                                                   80
Duke Libraries / Text > Data                           September 20, 2012




                               Software data


               • Ideally, use open-source software
               • Keep past versions of whatever software
                     you use




@rybesh #duketext                                                     80
Duke Libraries / Text > Data                            September 20, 2012




                               Software data


               • Ideally, use open-source software
               • Keep past versions of whatever software
                     you use
               • Use version control for your own scripts
                     and software


@rybesh #duketext                                                      80
Duke Libraries / Text > Data                      September 20, 2012




                               Documentary data




@rybesh #duketext                                                81
Duke Libraries / Text > Data                          September 20, 2012




                               Documentary data


               • This is the hardest data to manage




@rybesh #duketext                                                    81
Duke Libraries / Text > Data                              September 20, 2012




                               Documentary data


               • This is the hardest data to manage
               • Consider keeping a (public or private)
                     "lab notebook" blog




@rybesh #duketext                                                        81
Duke Libraries / Text > Data                              September 20, 2012




                               Documentary data


               • This is the hardest data to manage
               • Consider keeping a (public or private)
                     "lab notebook" blog
               • Anything else you write related to the
                     project, formal or informal


@rybesh #duketext                                                        81
Duke Libraries / Text > Data                      September 20, 2012




                         Long-term preservation




@rybesh #duketext                                                82
Duke Libraries / Text > Data                         September 20, 2012




                         Long-term preservation


            • Data under version control can be exported,
                   including all versions




@rybesh #duketext                                                   82
Duke Libraries / Text > Data                             September 20, 2012




                         Long-term preservation


            • Data under version control can be exported,
                   including all versions
            • Create static snapshots of websites, blogs, etc.


@rybesh #duketext                                                       82
Duke Libraries / Text > Data                             September 20, 2012




                         Long-term preservation


            • Data under version control can be exported,
                   including all versions
            • Create static snapshots of websites, blogs, etc.
            • Place everything in a long-term digital
                   repository such as DukeSpace


@rybesh #duketext                                                       82
Duke Libraries / Text > Data                September 20, 2012




                               Take-aways




@rybesh #duketext                                          83
Duke Libraries / Text > Data                           September 20, 2012




                               Take-aways

             • Text analysis can be a powerful tool.




@rybesh #duketext                                                     83
Duke Libraries / Text > Data                                September 20, 2012




                               Take-aways

             • Text analysis can be a powerful tool.
             • It's a systematic method of transforming
                   texts to produce new texts for interpretation.




@rybesh #duketext                                                          83
Duke Libraries / Text > Data                                September 20, 2012




                                Take-aways

             • Text analysis can be a powerful tool.
             • It's a systematic method of transforming
                   texts to produce new texts for interpretation.
             • It only augments human judgment and
                   interpretation; it can't replace them.




@rybesh #duketext                                                          83
Duke Libraries / Text > Data                                September 20, 2012




                                Take-aways

             • Text analysis can be a powerful tool.
             • It's a systematic method of transforming
                   texts to produce new texts for interpretation.
             • It only augments human judgment and
                   interpretation; it can't replace them.
             • Be excited by the possibilities
                   but skeptical of the hype.

@rybesh #duketext                                                          83
Duke Libraries / Text > Data             September 20, 2012




                               Thanks!




@rybesh #duketext                                       84
Duke Libraries / Text > Data                       September 20, 2012




                                 Thanks!
                               http://aesh.in/RC




@rybesh #duketext                                                 84
Duke Libraries / Text > Data                        September 20, 2012




                                  Thanks!
                                http://aesh.in/RC
                               ryanshaw@unc.edu




@rybesh #duketext                                                  84

More Related Content

Recently uploaded

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Featured

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 

Featured (20)

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 

Text-mining as a Research Tool in the Humanities and Social Sciences

  • 1. Duke Libraries / Text > Data September 20, 2012 Text-mining as a Research Tool in the Humanities and Social Sciences Ryan Shaw ryanshaw@unc.edu http://aesh.in/RC @rybesh #duketext 1
  • 2. Duke Libraries / Text > Data September 20, 2012 Text-mining as a Research Tool in the Humanities and Social Sciences Ryan Shaw ryanshaw@unc.edu http://aesh.in/RC @rybesh #duketext 1
  • 3. Duke Libraries / Text > Data September 20, 2012 Text-mining as a Research Tool in the Humanities and Social Sciences Ryan Shaw ryanshaw@unc.edu http://aesh.in/RC @rybesh #duketext 1
  • 4. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  • 5. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  • 6. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  • 7. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  • 8. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  • 9. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  • 10. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  • 11. Duke Libraries / Text > Data September 20, 2012 Roberto Busa @rybesh #duketext 3
  • 12. Duke Libraries / Text > Data September 20, 2012 Automated text analysis @rybesh #duketext 4
  • 13. Duke Libraries / Text > Data September 20, 2012 Automated text analysis Automated text analysis is a tool for discovery and measurement in textual data of prevalent attitudes, concepts, or events. O'Connor, Bamman & Smith 2011 "Computational Text Analysis for Social Science" http://goo.gl/PxruI @rybesh #duketext 4
  • 14. Duke Libraries / Text > Data September 20, 2012 Automated text analysis Automated text analysis is a tool for discovery and measurement in textual data of patterns of language use interpretable as prevalent attitudes, concepts, or events. O'Connor, Bamman & Smith 2011 "Computational Text Analysis for Social Science" http://goo.gl/PxruI @rybesh #duketext 5
  • 15. Duke Libraries / Text > Data September 20, 2012 Language modeling Black 1962, "Models and Archetypes" http://goo.gl/zKtrx @rybesh #duketext 6
  • 16. Duke Libraries / Text > Data September 20, 2012 Language modeling • Methods for automated text analysis are based on mathematical models of language Black 1962, "Models and Archetypes" http://goo.gl/zKtrx @rybesh #duketext 6
  • 17. Duke Libraries / Text > Data September 20, 2012 Language modeling • Methods for automated text analysis are based on mathematical models of language • Mathematical models distinguish elements and make explicit the relations among them Black 1962, "Models and Archetypes" http://goo.gl/zKtrx @rybesh #duketext 6
  • 18. Duke Libraries / Text > Data September 20, 2012 Language modeling • Methods for automated text analysis are based on mathematical models of language • Mathematical models distinguish elements and make explicit the relations among them • They do not explain, but they can be interpreted Black 1962, "Models and Archetypes" http://goo.gl/zKtrx @rybesh #duketext 6
  • 19. Duke Libraries / Text > Data September 20, 2012 Language modeling Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 7
  • 20. Duke Libraries / Text > Data September 20, 2012 Language modeling • All mathematical models of language are necessarily wrong Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 7
  • 21. Duke Libraries / Text > Data September 20, 2012 Language modeling • All mathematical models of language are necessarily wrong • Nevertheless they may be useful Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 7
  • 22. Duke Libraries / Text > Data September 20, 2012 Language modeling • All mathematical models of language are necessarily wrong • Nevertheless they may be useful • They must be evaluated on their ability to help scholars make inferences, achieve insights, and generate new interpretations Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 7
  • 23. Duke Libraries / Text > Data September 20, 2012 Plan of attack @rybesh #duketext 8
  • 24. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text @rybesh #duketext 8
  • 25. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text • Representing text @rybesh #duketext 8
  • 26. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text • Representing text • Analyzing text @rybesh #duketext 8
  • 27. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text • Representing text • Analyzing text • Validating results @rybesh #duketext 8
  • 28. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text • Representing text • Analyzing text • Validating results • Managing data @rybesh #duketext 8
  • 29. Duke Libraries / Text > Data September 20, 2012 Acquiring text Collecting your data @rybesh #duketext 9
  • 30. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 10
  • 31. Duke Libraries / Text > Data September 20, 2012 Sources @rybesh #duketext 11
  • 32. Duke Libraries / Text > Data September 20, 2012 Sources • Existing digital corpora @rybesh #duketext 11
  • 33. Duke Libraries / Text > Data September 20, 2012 Sources • Existing digital corpora • Other digital sources (e.g. Web, twitter) @rybesh #duketext 11
  • 34. Duke Libraries / Text > Data September 20, 2012 Sources • Existing digital corpora • Other digital sources (e.g. Web, twitter) • Undigitized text @rybesh #duketext 11
  • 35. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora @rybesh #duketext 12
  • 36. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora • Ideally, texts will be available as XML @rybesh #duketext 12
  • 37. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora • Ideally, texts will be available as XML • Quality of text and metadata is high @rybesh #duketext 12
  • 38. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora • Ideally, texts will be available as XML • Quality of text and metadata is high • But collections tend to be small @rybesh #duketext 12
  • 39. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora • Ideally, texts will be available as XML • Quality of text and metadata is high • But collections tend to be small • Licensing agreements may prohibit text analysis @rybesh #duketext 12
  • 40. Duke Libraries / Text > Data September 20, 2012 • 10.5 million total volumes • 5.5 million book titles • 270,000 serial titles • 3.2 million public domain http://www.hathitrust.org/htrc @rybesh #duketext 13
  • 41. Duke Libraries / Text > Data September 20, 2012 Other digital sources @rybesh #duketext 14
  • 42. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API @rybesh #duketext 14
  • 43. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API • Websites without APIs can be "scraped" @rybesh #duketext 14
  • 44. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API • Websites without APIs can be "scraped" • Generally requires custom programming @rybesh #duketext 14
  • 45. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API • Websites without APIs can be "scraped" • Generally requires custom programming • Website restrictions may limit how much or how quickly texts can be collected @rybesh #duketext 14
  • 46. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API • Websites without APIs can be "scraped" • Generally requires custom programming • Website restrictions may limit how much or how quickly texts can be collected • Metadata will be limited or absent @rybesh #duketext 14
  • 47. Duke Libraries / Text > Data September 20, 2012 Undigitized text @rybesh #duketext 15
  • 48. Duke Libraries / Text > Data September 20, 2012 Undigitized text • Undigitized text must be scanned and subjected to Optical Character Recognition @rybesh #duketext 15
  • 49. Duke Libraries / Text > Data September 20, 2012 Undigitized text • Undigitized text must be scanned and subjected to Optical Character Recognition • Time and labor intensive @rybesh #duketext 15
  • 50. Duke Libraries / Text > Data September 20, 2012 Undigitized text • Undigitized text must be scanned and subjected to Optical Character Recognition • Time and labor intensive • OCR will introduce errors in your texts @rybesh #duketext 15
  • 51. Duke Libraries / Text > Data September 20, 2012 Undigitized text • Undigitized text must be scanned and subjected to Optical Character Recognition • Time and labor intensive • OCR will introduce errors in your texts • You need to produce your own metadata @rybesh #duketext 15
  • 52. Duke Libraries / Text > Data September 20, 2012 Preparing texts @rybesh #duketext 16
  • 53. Duke Libraries / Text > Data September 20, 2012 Preparing texts • OCR errors @rybesh #duketext 16
  • 54. Duke Libraries / Text > Data September 20, 2012 Preparing texts • OCR errors • Words broken across lines @rybesh #duketext 16
  • 55. Duke Libraries / Text > Data September 20, 2012 Preparing texts • OCR errors • Words broken across lines • Running headers and footers @rybesh #duketext 16
  • 56. Duke Libraries / Text > Data September 20, 2012 Preparing texts • OCR errors • Words broken across lines • Running headers and footers • Breaking into paragraphs, sentences, etc. @rybesh #duketext 16
  • 57. Duke Libraries / Text > Data September 20, 2012 Preparing texts @rybesh #duketext 17
  • 58. Duke Libraries / Text > Data September 20, 2012 Preparing texts • The bulk of your time will be spent acquiring and preparing your texts @rybesh #duketext 17
  • 59. Duke Libraries / Text > Data September 20, 2012 Preparing texts • The bulk of your time will be spent acquiring and preparing your texts • Worth your time to learn a scripting language (such as Python) @rybesh #duketext 17
  • 60. Duke Libraries / Text > Data September 20, 2012 Preparing texts • The bulk of your time will be spent acquiring and preparing your texts • Worth your time to learn a scripting language (such as Python) • Command-line text-processing tools on Mac OS and Unix also very useful @rybesh #duketext 17
  • 61. Duke Libraries / Text > Data September 20, 2012 Representing text Turning words into numbers @rybesh #duketext 18
  • 62. Duke Libraries / Text > Data September 20, 2012 Slowly welling from the point of her gold nib, pale blue ink dissolved the full stop; for there her pen stuck; her eyes fixed, and tears slowly filled them. The entire bay quivered; the lighthouse wobbled; and she had the illusion that the mast of Mr. Connor's little yacht was bending like a wax candle in the sun. She winked quickly. Accidents were awful things. She winked again. The mast was straight; the waves were regular; the lighthouse was upright; but the blot had spread. @rybesh #duketext 19
  • 63. Duke Libraries / Text > Data September 20, 2012 11 the 1 wax 1 quivered 3 was 1 waves 1 quickly 3 she 1 upright 1 point 3 her 1 things 1 pen 2 winked 1 there 1 pale 2 were 1 them 1 nib 2 slowly 1 that 1 mr 2 of 1 tears 1 little 2 mast 1 sun 1 like 2 lighthouse 1 stuck 1 ink 2 had 1 straight 1 in 2 and 1 stop 1 illusion 1 yacht 1 spread 1 gold 1 wobbled 1 s 1 full 1 welling 1 regular 1 from @rybesh #duketext 20
  • 64. Duke Libraries / Text > Data September 20, 2012 11 the 1 wax 1 quiver 3 wa 1 wave 1 quickli 3 she 1 upright 1 point 3 her 1 thing 1 pen 2 wink 1 there 1 pale 2 were 1 them 1 nib 2 slowli 1 that 1 mr 2 of 1 tear 1 littl 2 mast 1 sun 1 like 2 lighthous 1 stuck 1 ink 2 had 1 straight 1 in 2 and 1 stop 1 illus 1 yacht 1 spread 1 gold 1 wobbl 1 s 1 full 1 well 1 regular 1 from @rybesh #duketext 21
  • 65. Duke Libraries / Text > Data September 20, 2012 11 the 1 wax 1 quiver 3 wa 1 wave 1 quickli 3 she 1 upright 1 point 3 her 1 thing 1 pen 2 wink 1 there 1 pale 2 were 1 them 1 nib 2 slowli 1 that 1 mr 2 of 1 tear 1 littl 2 mast 1 sun 1 like 2 lighthous 1 stuck 1 ink 2 had 1 straight 1 in 2 and 1 stop 1 illus 1 yacht 1 spread 1 gold 1 wobbl 1 s 1 full 1 well 1 regular 1 from @rybesh #duketext 22
  • 66. Duke Libraries / Text > Data September 20, 2012 doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 accid 1 actual 1 again 1 1 alreadi 1 antenna 1 archer 1 avoid 2 1 awai 1 aw 1 bag 1 bandanna 1 barfoot 2 @rybesh #duketext 23
  • 67. Duke Libraries / Text > Data September 20, 2012 Document similarity 2 again 1 1 2 @rybesh #duketext avoid 24
  • 68. Duke Libraries / Text > Data September 20, 2012 Document similarity 2 again doc 1 1 1 2 @rybesh #duketext avoid 24
  • 69. Duke Libraries / Text > Data September 20, 2012 Document similarity 2 again doc 6 doc 1 1 1 2 @rybesh #duketext avoid 24
  • 70. Duke Libraries / Text > Data September 20, 2012 Document similarity 2 again doc 6 doc 1 1 ar ity m il si 1 2 @rybesh #duketext avoid 24
  • 71. Duke Libraries / Text > Data September 20, 2012 Analyzing text Counting, comparing, categorizing and pattern-finding @rybesh #duketext 25
  • 72. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  • 73. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  • 74. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  • 75. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  • 76. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) • Dictionary methods Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  • 77. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) • Dictionary methods • Supervised machine learning Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  • 78. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) • Dictionary methods • Supervised machine learning • Unsupervised machine learning Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  • 79. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) • Dictionary methods • Supervised machine learning • Unsupervised machine learning Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 27
  • 80. Duke Libraries / Text > Data September 20, 2012 Counting words http://www.nytimes.com/ref/washington/20070123_STATEOFUNION.html @rybesh #duketext 28
  • 81. Duke Libraries / Text > Data September 20, 2012 Counting words @rybesh #duketext 29
  • 82. Duke Libraries / Text > Data September 20, 2012 Michel et al. 2010 @rybesh #duketext http://dx.doi.org/10.1126/science.1199644 30
  • 83. Duke Libraries / Text > Data September 20, 2012 Counting words @rybesh #duketext 31
  • 84. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed @rybesh #duketext 31
  • 85. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed • Results are replicable @rybesh #duketext 31
  • 86. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed • Results are replicable • Comparisons require metadata e.g. publication year, language, subject category, location @rybesh #duketext 31
  • 87. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed • Results are replicable • Comparisons require metadata e.g. publication year, language, subject category, location • Word use is ambiguous @rybesh #duketext 31
  • 88. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed • Results are replicable • Comparisons require metadata e.g. publication year, language, subject category, location • Word use is ambiguous • Spelling may vary @rybesh #duketext 31
  • 89. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 32
  • 90. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 33
  • 91. Duke Libraries / Text > Data September 20, 2012 Concordance tools @rybesh #duketext 34
  • 92. Duke Libraries / Text > Data September 20, 2012 Dictionary methods @rybesh #duketext 35
  • 93. Duke Libraries / Text > Data September 20, 2012 Dictionary methods • A dictionary is simply a list of words @rybesh #duketext 35
  • 94. Duke Libraries / Text > Data September 20, 2012 Dictionary methods • A dictionary is simply a list of words • Lists are compiled for specific categories of interest: negative words, law-related words, names of places, names of chemicals, etc. @rybesh #duketext 35
  • 95. Duke Libraries / Text > Data September 20, 2012 Dictionary methods • A dictionary is simply a list of words • Lists are compiled for specific categories of interest: negative words, law-related words, names of places, names of chemicals, etc. • May be custom-built or reused @rybesh #duketext 35
  • 96. Duke Libraries / Text > Data September 20, 2012 Lexicoder Sentiment Dictionary A LIE 0 WOUNDED 0 ABILITY* 1 WOOS 1 ABANDON* 0 WOUNDS 0 ABOUND* 1 WORKABLE* 1 ABAS* 0 WRATH* 0 ABSOLV* 1 WORKMANSHIP* 1 ABATTOIR* 0 WRECK* 0 ABSORBENT* 1 WORSHIP* 1 ABDICAT* 0 WRESTL* 0 ABSORPTION* 1 WORTH 1 ABERRA* 0 WRETCH* 0 ABUNDANC* 1 WORTH WHILE* 1 ABHOR* 0 WRITHE* 0 ABUNDANT* 1 WORTHI* 1 ABJECT* 0 WRONG* 0 ACCED* 1 WORTHWHILE* 1 ABNORMAL* 0 XENOPHOB* 0 ACCENTUAT* 1 WORTHY* 1 ABOLISH* 0 YAWN* 0 ACCEPT* 1 YOUNG AT HEART 1 ABOMINAB* 0 YEARN* 0 ACCESSIB* 1 ZEAL 1 ABOMINAT* 0 YUCK* 0 ACCLAIM* 1 ZEALOUS* 1 ABRASIV* 0 ZEALOT* 0 ACCLAMATION* 1 ZEST* 1 @rybesh #duketext 36
  • 97. Duke Libraries / Text > Data September 20, 2012 ACQUITTANCE DOCKET LEGALIZATIONS QUITCLAIM ADJOURNING ESCHEATED LEGALLY REBUTS APPELLANTS EXCEEDENCES LITIGATORS REQUESTER APPOINTOR EXCULPATED MISTRIALS RESCINDS ARBITRATE FOREBEAR NOTARIZE STATUTE ASSERTABLE INASMUCH NOTARIZED SUBPARAGRAPHS CHATTEL INDEMNITY OBLIGOR SUBPOENAS CODIFICATIONS INJUNCTION PERSONAM SUBTRUSTS CONVICTED INTERLOCUTORY PLEADS TENANTABILITY COUNTERSUIT INTERPLEADER POSTJUDGMENT TESTAMENTARY DEFEASANCE INTERROGATE PRETRIAL UNENCUMBERED DELEGATEE IRREVOCABLY PRIMA UNREMEDIATED DEPOSED LEGALIZATION PROSECUTIONS WHEREOF @rybesh #duketext 37
  • 98. Duke Libraries / Text > Data September 20, 2012 Litigious Words ACQUITTANCE DOCKET LEGALIZATIONS QUITCLAIM ADJOURNING ESCHEATED LEGALLY REBUTS APPELLANTS EXCEEDENCES LITIGATORS REQUESTER APPOINTOR EXCULPATED MISTRIALS RESCINDS ARBITRATE FOREBEAR NOTARIZE STATUTE ASSERTABLE INASMUCH NOTARIZED SUBPARAGRAPHS CHATTEL INDEMNITY OBLIGOR SUBPOENAS CODIFICATIONS INJUNCTION PERSONAM SUBTRUSTS CONVICTED INTERLOCUTORY PLEADS TENANTABILITY COUNTERSUIT INTERPLEADER POSTJUDGMENT TESTAMENTARY DEFEASANCE INTERROGATE PRETRIAL UNENCUMBERED DELEGATEE IRREVOCABLY PRIMA UNREMEDIATED DEPOSED LEGALIZATION PROSECUTIONS WHEREOF @rybesh #duketext 37
  • 99. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm @rybesh #duketext 38
  • 100. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm • For each word in document: @rybesh #duketext 38
  • 101. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm • For each word in document: • +1 if the word is in the positive list @rybesh #duketext 38
  • 102. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm • For each word in document: • +1 if the word is in the positive list • –1 if the word is in the negative list @rybesh #duketext 38
  • 103. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm • For each word in document: • +1 if the word is in the positive list • –1 if the word is in the negative list • Divide the total by the number of words @rybesh #duketext 38
  • 104. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 39
  • 105. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words @rybesh #duketext 40
  • 106. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words – 51 uses of negative words @rybesh #duketext 40
  • 107. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words – 51 uses of negative words = –25 @rybesh #duketext 40
  • 108. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words – 51 uses of negative words –25 / 779 total words @rybesh #duketext 40
  • 109. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words – 51 uses of negative words –25 / 779 total words = –0.032 @rybesh #duketext 40
  • 110. Duke Libraries / Text > Data September 20, 2012 AGAINST LIMITED AGGRESSIVENESS LIMITING ATTACK NEGATE ATTACKING OFFENSE CHALLENGE OFFENSIVE ADEQUATELY IMPROVEMENT CONTRAST OFFENSIVELY ADVANTAGE KEEPING DEFENSIVE OPPOSING ASSISTS LIKE DEFICIENCIES PLAGUED EFFICIENT PATRIOT DEVIL POOR EFFICIENTLY PERFECT DEVILS PROBLEM EFFORT RESPONSIBLE DISMAL SHORTCOMINGS FREE SIGNIFICANT EXPLOIT SLUGGISH FRESHMAN STRONGER FAILED THORNTON GOOD SUCCESS FOUL THREATS GREAT WELL FOULING TOO FOULS TROUBLE FUTILITY TROUBLES INABILITY UNABLE @rybesh #duketext 41
  • 111. Duke Libraries / Text > Data September 20, 2012 AGAINST LIMITED AGGRESSIVENESS LIMITING ATTACK NEGATE ATTACKING OFFENSE CHALLENGE OFFENSIVE ADEQUATELY IMPROVEMENT CONTRAST OFFENSIVELY ADVANTAGE KEEPING DEFENSIVE OPPOSING ASSISTS LIKE DEFICIENCIES PLAGUED EFFICIENT PATRIOT DEVIL POOR EFFICIENTLY PERFECT DEVILS PROBLEM EFFORT RESPONSIBLE DISMAL SHORTCOMINGS FREE SIGNIFICANT EXPLOIT SLUGGISH FRESHMAN STRONGER FAILED THORNTON GOOD SUCCESS FOUL THREATS GREAT WELL FOULING TOO FOULS TROUBLE FUTILITY TROUBLES INABILITY UNABLE @rybesh #duketext 42
  • 112. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 43
  • 113. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 43
  • 114. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 43
  • 115. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning @rybesh #duketext 44
  • 116. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning • The situation: you know the categories of interest @rybesh #duketext 44
  • 117. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning • The situation: you know the categories of interest • The problem: human coding of documents doesn't scale @rybesh #duketext 44
  • 118. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning • The situation: you know the categories of interest • The problem: human coding of documents doesn't scale • The solution: teach a robot to do it @rybesh #duketext 44
  • 119. Duke Libraries / Text > Data September 20, 2012 Welcome your robot overlords @rybesh #duketext 45
  • 120. Duke Libraries / Text > Data September 20, 2012 Welcome your robot overlords @rybesh #duketext 45
  • 121. Duke Libraries / Text > Data September 20, 2012 Augmenting human capacity @rybesh #duketext 46
  • 122. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 47
  • 123. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 47
  • 124. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 47
  • 125. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 47
  • 126. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 47
  • 127. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 47
  • 128. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning @rybesh #duketext 48
  • 129. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning 1. Create a training set. @rybesh #duketext 48
  • 130. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning 1. Create a training set. 2. Use the training set to "teach" a supervised learning algorithm how to map document features (e.g. words) to categories. @rybesh #duketext 48
  • 131. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning 1. Create a training set. 2. Use the training set to "teach" a supervised learning algorithm how to map document features (e.g. words) to categories. 3. Test your classifying machine to see if it learned correctly. @rybesh #duketext 48
  • 132. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning 1. Create a training set. 2. Use the training set to "teach" a supervised learning algorithm how to map document features (e.g. words) to categories. 3. Test your classifying machine to see if it learned correctly. 4. Use it to classify the rest of your documents. @rybesh #duketext 48
  • 133. Duke Libraries / Text > Data September 20, 2012 Creating a training set @rybesh #duketext 49
  • 134. Duke Libraries / Text > Data September 20, 2012 Creating a training set • Create a coding scheme that humans can use reliably and without ambiguity. @rybesh #duketext 49
  • 135. Duke Libraries / Text > Data September 20, 2012 Creating a training set • Create a coding scheme that humans can use reliably and without ambiguity. • Select (ideally randomly) a subset of your documents, and code them by hand. @rybesh #duketext 49
  • 136. Duke Libraries / Text > Data September 20, 2012 Creating a training set • Create a coding scheme that humans can use reliably and without ambiguity. • Select (ideally randomly) a subset of your documents, and code them by hand. • You need "enough" documents: more categories, more documents. @rybesh #duketext 49
  • 137. Duke Libraries / Text > Data September 20, 2012 Supervised learning algorithms @rybesh #duketext 50
  • 138. Duke Libraries / Text > Data September 20, 2012 Supervised learning algorithms • Many kinds: Naïve Bayes, decision trees / random forests, support vector machines, neural networks, etc. @rybesh #duketext 50
  • 139. Duke Libraries / Text > Data September 20, 2012 Supervised learning algorithms • Many kinds: Naïve Bayes, decision trees / random forests, support vector machines, neural networks, etc. • No "best" one: performance is domain- and dataset-specific @rybesh #duketext 50
  • 140. Duke Libraries / Text > Data September 20, 2012 Supervised learning algorithms • Many kinds: Naïve Bayes, decision trees / random forests, support vector machines, neural networks, etc. • No "best" one: performance is domain- and dataset-specific • "Ensembles" of different algorithms can often outperform single algorithms @rybesh #duketext 50
  • 141. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning @rybesh #duketext 51
  • 142. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning @rybesh #duketext 52
  • 143. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning • The situation: you don't know the categories of interest, or want to discover new ones @rybesh #duketext 52
  • 144. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning • The situation: you don't know the categories of interest, or want to discover new ones • The solution: have a robot explore and find possible categorizations for you, and use them to categorize documents @rybesh #duketext 52
  • 145. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning • The situation: you don't know the categories of interest, or want to discover new ones • The solution: have a robot explore and find possible categorizations for you, and use them to categorize documents • Also known as "clustering" @rybesh #duketext 52
  • 146. Duke Libraries / Text > Data September 20, 2012 No free lunch Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 53
  • 147. Duke Libraries / Text > Data September 20, 2012 No free lunch • No need for manual coding beforehand Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 53
  • 148. Duke Libraries / Text > Data September 20, 2012 No free lunch • No need for manual coding beforehand • But as much or more manual labor is needed to evaluate suggested categorizations afterwards Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 53
  • 149. Duke Libraries / Text > Data September 20, 2012 No free lunch • No need for manual coding beforehand • But as much or more manual labor is needed to evaluate suggested categorizations afterwards • The value is a novel categorization, not time or labor saved Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 53
  • 150. Duke Libraries / Text > Data September 20, 2012 Two kinds of unsupervised learning @rybesh #duketext 54
  • 151. Duke Libraries / Text > Data September 20, 2012 Two kinds of unsupervised learning • Single membership clustering: each document is assigned to one category @rybesh #duketext 54
  • 152. Duke Libraries / Text > Data September 20, 2012 Two kinds of unsupervised learning • Single membership clustering: each document is assigned to one category • Mixed membership clustering: a document may be assigned to multiple categories, each with a different proportion @rybesh #duketext 54
  • 153. Duke Libraries / Text > Data September 20, 2012 Single membership clustering @rybesh #duketext 55
  • 154. Duke Libraries / Text > Data September 20, 2012 Single membership clustering 1. Define a quantitative measure of similarity between documents. @rybesh #duketext 55
  • 155. Duke Libraries / Text > Data September 20, 2012 Single membership clustering 1. Define a quantitative measure of similarity between documents. 2. Define a quantitative measure of how "good" a cluster is. @rybesh #duketext 55
  • 156. Duke Libraries / Text > Data September 20, 2012 Single membership clustering 1. Define a quantitative measure of similarity between documents. 2. Define a quantitative measure of how "good" a cluster is. 3. Define a process for optimizing the overall goodness of the clusters. @rybesh #duketext 55
  • 157. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 56
  • 158. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 56
  • 159. Duke Libraries / Text > Data September 20, 2012 http://shabal.in/visuals.html @rybesh #duketext 57
  • 160. Duke Libraries / Text > Data September 20, 2012 http://shabal.in/visuals.html @rybesh #duketext 57
  • 161. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering @rybesh #duketext 58
  • 162. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering • Topic modeling is a popular example @rybesh #duketext 58
  • 163. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering • Topic modeling is a popular example • Each document is modeled as a mixture of categories or topics @rybesh #duketext 58
  • 164. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering • Topic modeling is a popular example • Each document is modeled as a mixture of categories or topics • A document is a probability distribution over topics @rybesh #duketext 58
  • 165. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering • Topic modeling is a popular example • Each document is modeled as a mixture of categories or topics • A document is a probability distribution over topics • A topic is a probability distribution over words @rybesh #duketext 58
  • 166. Duke Libraries / Text > Data September 20, 2012 Probability distribution @rybesh #duketext 59
  • 167. Duke Libraries / Text > Data September 20, 2012 "Generating" text @rybesh #duketext 60
  • 168. Duke Libraries / Text > Data September 20, 2012 "Generating" text 1. Roll our "topic dice" to choose a topic. @rybesh #duketext 60
  • 169. Duke Libraries / Text > Data September 20, 2012 "Generating" text 1. Roll our "topic dice" to choose a topic. 2. Get the "word dice" corresponding to the the chosen topic. @rybesh #duketext 60
  • 170. Duke Libraries / Text > Data September 20, 2012 "Generating" text 1. Roll our "topic dice" to choose a topic. 2. Get the "word dice" corresponding to the the chosen topic. 3. Roll the "word dice" to choose a word. @rybesh #duketext 60
  • 171. Duke Libraries / Text > Data September 20, 2012 "Generating" text 1. Roll our "topic dice" to choose a topic. 2. Get the "word dice" corresponding to the the chosen topic. 3. Roll the "word dice" to choose a word. 4. Repeat until we've chosen all the words for our text. @rybesh #duketext 60
  • 172. Duke Libraries / Text > Data September 20, 2012 Topic modeling demo @rybesh #duketext 61
  • 173. Duke Libraries / Text > Data September 20, 2012 http://dsl.richmond.edu/dispatch/ @rybesh #duketext 62
  • 174. Duke Libraries / Text > Data September 20, 2012 Complex statistics / computation Topic models Weaker Stronger domain Supervised methods domain assumptions assumptions Word counting Dictionary methods Simple statistics / computation @rybesh #duketext O'Connor, Bamman & Smith 2011 http://goo.gl/PxruI 63
  • 175. Duke Libraries / Text > Data September 20, 2012 Validating results Keeping the machines from leading you astray @rybesh #duketext 64
  • 176. Duke Libraries / Text > Data September 20, 2012 Validating word counts @rybesh #duketext 65
  • 177. Duke Libraries / Text > Data September 20, 2012 Validating word counts • Text data may have errors (e.g. from OCR) @rybesh #duketext 65
  • 178. Duke Libraries / Text > Data September 20, 2012 Validating word counts • Text data may have errors (e.g. from OCR) • Metadata may have errors @rybesh #duketext 65
  • 179. Duke Libraries / Text > Data September 20, 2012 Validating word counts • Text data may have errors (e.g. from OCR) • Metadata may have errors • Texts may appear multiple times @rybesh #duketext 65
  • 180. Duke Libraries / Text > Data September 20, 2012 Validating word counts • Text data may have errors (e.g. from OCR) • Metadata may have errors • Texts may appear multiple times • Collections are biased samples @rybesh #duketext 65
  • 181. Duke Libraries / Text > Data September 20, 2012 http://languagelog.ldc.upenn.edu/nll/?p=1701 @rybesh #duketext 66
  • 182. Duke Libraries / Text > Data September 20, 2012 http://languagelog.ldc.upenn.edu/nll/?p=1701 @rybesh #duketext 66
  • 183. Duke Libraries / Text > Data September 20, 2012 http://languagelog.ldc.upenn.edu/nll/?p=1701 @rybesh #duketext 66
  • 184. Duke Libraries / Text > Data September 20, 2012 http://languagelog.ldc.upenn.edu/nll/?p=1701 @rybesh #duketext 66
  • 185. Duke Libraries / Text > Data September 20, 2012 Validating dictionary methods @rybesh #duketext 67
  • 186. Duke Libraries / Text > Data September 20, 2012 Validating dictionary methods • Must verify that dictionary categorizations match human judgments @rybesh #duketext 67
  • 187. Duke Libraries / Text > Data September 20, 2012 Validating dictionary methods • Must verify that dictionary categorizations match human judgments • But humans can't reliably "score" documents on "positivity" or "litigiousness" @rybesh #duketext 67
  • 188. Duke Libraries / Text > Data September 20, 2012 Validating dictionary methods • Must verify that dictionary categorizations match human judgments • But humans can't reliably "score" documents on "positivity" or "litigiousness" • Better to convert scores to simple binaries @rybesh #duketext 67
  • 189. Duke Libraries / Text > Data September 20, 2012 Validating supervised methods @rybesh #duketext 68
  • 190. Duke Libraries / Text > Data September 20, 2012 Validating supervised methods • Ideally: take two random non-overlapping samples and manually code them. @rybesh #duketext 68
  • 191. Duke Libraries / Text > Data September 20, 2012 Validating supervised methods • Ideally: take two random non-overlapping samples and manually code them. • Use the first sample to train your supervised learning algorithm. @rybesh #duketext 68
  • 192. Duke Libraries / Text > Data September 20, 2012 Validating supervised methods • Ideally: take two random non-overlapping samples and manually code them. • Use the first sample to train your supervised learning algorithm. • Use the second sample to evaluate its performance. @rybesh #duketext 68
  • 193. Duke Libraries / Text > Data September 20, 2012 figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110 @rybesh #duketext 262 documents 69
  • 194. Duke Libraries / Text > Data September 20, 2012 figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110 @rybesh #duketext 262 documents 69
  • 195. Duke Libraries / Text > Data September 20, 2012 Accuracy: 197 / 262 = 75% figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110 @rybesh #duketext 262 documents 69
  • 196. Duke Libraries / Text > Data September 20, 2012 Precision: 57 / 78 = 73% figurative category figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110 @rybesh #duketext 262 documents 70
  • 197. Duke Libraries / Text > Data September 20, 2012 Recall: 57 / 91 = 63% figurative category figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110 @rybesh #duketext 262 documents 71
  • 198. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods @rybesh #duketext 72
  • 199. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • There are statistical measures of how well a particular clustering "fits" the data @rybesh #duketext 72
  • 200. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • There are statistical measures of how well a particular clustering "fits" the data • These are not appropriate for evaluating unsupervised clustering of texts @rybesh #duketext 72
  • 201. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • There are statistical measures of how well a particular clustering "fits" the data • These are not appropriate for evaluating unsupervised clustering of texts • The "data" is butchered text, we don't want to fit it well @rybesh #duketext 72
  • 202. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods @rybesh #duketext 73
  • 203. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Does the categorization make sense? @rybesh #duketext 73
  • 204. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Does the categorization make sense? • Are the categories distinct? @rybesh #duketext 73
  • 205. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Does the categorization make sense? • Are the categories distinct? • Are they internally consistent? @rybesh #duketext 73
  • 206. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Does the categorization make sense? • Are the categories distinct? • Are they internally consistent? • Do they provide insight? @rybesh #duketext 73
  • 207. Duke Libraries / Text > Data September 20, 2012 Validating topic coherence { dog, cat, horse, apple, pig, cow } Chang et al. 2009 http://goo.gl/FCizP @rybesh #duketext 74
  • 208. Duke Libraries / Text > Data September 20, 2012 Validating topic coherence { dog, cat, horse, apple, pig, cow } Chang et al. 2009 http://goo.gl/FCizP @rybesh #duketext 74
  • 209. Duke Libraries / Text > Data September 20, 2012 Validating topic coherence { dog, cat, horse, apple, pig, cow } { car, teacher, platypus, agile, blue, Zaire } Chang et al. 2009 http://goo.gl/FCizP @rybesh #duketext 74
  • 210. Duke Libraries / Text > Data September 20, 2012 Validating topic coherence { dog, cat, horse, apple, pig, cow } { car, teacher, platypus, agile, blue, Zaire } ? Chang et al. 2009 http://goo.gl/FCizP @rybesh #duketext 74
  • 211. Duke Libraries / Text > Data September 20, 2012 Validating topic assignment @rybesh #duketext 75
  • 212. Duke Libraries / Text > Data September 20, 2012 Validating topic assignment @rybesh #duketext 75
  • 213. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods @rybesh #duketext 76
  • 214. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Compared to other (manual) categorizations, how well does this one approximate judgments of document relatedness? @rybesh #duketext 76
  • 215. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Compared to other (manual) categorizations, how well does this one approximate judgments of document relatedness? • Do the categories correlate with external facts? @rybesh #duketext 76
  • 216. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Compared to other (manual) categorizations, how well does this one approximate judgments of document relatedness? • Do the categories correlate with external facts? • Turn the categories into a coding scheme and apply supervised methods @rybesh #duketext 76
  • 217. Duke Libraries / Text > Data September 20, 2012 Managing data Helping others stand on your shoulders @rybesh #duketext 77
  • 218. Duke Libraries / Text > Data September 20, 2012 Three kinds of data @rybesh #duketext 78
  • 219. Duke Libraries / Text > Data September 20, 2012 Three kinds of data 1. The texts you're analyzing and derivations thereof @rybesh #duketext 78
  • 220. Duke Libraries / Text > Data September 20, 2012 Three kinds of data 1. The texts you're analyzing and derivations thereof 2. The software code you're using to process and analyze your texts @rybesh #duketext 78
  • 221. Duke Libraries / Text > Data September 20, 2012 Three kinds of data 1. The texts you're analyzing and derivations thereof 2. The software code you're using to process and analyze your texts 3. Documentation of your process @rybesh #duketext 78
  • 222. Duke Libraries / Text > Data September 20, 2012 Textual data @rybesh #duketext 79
  • 223. Duke Libraries / Text > Data September 20, 2012 Textual data • You want to keep all intermediate versions of the texts you're processing @rybesh #duketext 79
  • 224. Duke Libraries / Text > Data September 20, 2012 Textual data • You want to keep all intermediate versions of the texts you're processing • A version control system is ideal for this @rybesh #duketext 79
  • 225. Duke Libraries / Text > Data September 20, 2012 Textual data • You want to keep all intermediate versions of the texts you're processing • A version control system is ideal for this • Version control hosting platforms such as GitHub are ideal for sharing your data too @rybesh #duketext 79
  • 226. Duke Libraries / Text > Data September 20, 2012 Software data @rybesh #duketext 80
  • 227. Duke Libraries / Text > Data September 20, 2012 Software data • Ideally, use open-source software @rybesh #duketext 80
  • 228. Duke Libraries / Text > Data September 20, 2012 Software data • Ideally, use open-source software • Keep past versions of whatever software you use @rybesh #duketext 80
  • 229. Duke Libraries / Text > Data September 20, 2012 Software data • Ideally, use open-source software • Keep past versions of whatever software you use • Use version control for your own scripts and software @rybesh #duketext 80
  • 230. Duke Libraries / Text > Data September 20, 2012 Documentary data @rybesh #duketext 81
  • 231. Duke Libraries / Text > Data September 20, 2012 Documentary data • This is the hardest data to manage @rybesh #duketext 81
  • 232. Duke Libraries / Text > Data September 20, 2012 Documentary data • This is the hardest data to manage • Consider keeping a (public or private) "lab notebook" blog @rybesh #duketext 81
  • 233. Duke Libraries / Text > Data September 20, 2012 Documentary data • This is the hardest data to manage • Consider keeping a (public or private) "lab notebook" blog • Anything else you write related to the project, formal or informal @rybesh #duketext 81
  • 234. Duke Libraries / Text > Data September 20, 2012 Long-term preservation @rybesh #duketext 82
  • 235. Duke Libraries / Text > Data September 20, 2012 Long-term preservation • Data under version control can be exported, including all versions @rybesh #duketext 82
  • 236. Duke Libraries / Text > Data September 20, 2012 Long-term preservation • Data under version control can be exported, including all versions • Create static snapshots of websites, blogs, etc. @rybesh #duketext 82
  • 237. Duke Libraries / Text > Data September 20, 2012 Long-term preservation • Data under version control can be exported, including all versions • Create static snapshots of websites, blogs, etc. • Place everything in a long-term digital repository such as DukeSpace @rybesh #duketext 82
  • 238. Duke Libraries / Text > Data September 20, 2012 Take-aways @rybesh #duketext 83
  • 239. Duke Libraries / Text > Data September 20, 2012 Take-aways • Text analysis can be a powerful tool. @rybesh #duketext 83
  • 240. Duke Libraries / Text > Data September 20, 2012 Take-aways • Text analysis can be a powerful tool. • It's a systematic method of transforming texts to produce new texts for interpretation. @rybesh #duketext 83
  • 241. Duke Libraries / Text > Data September 20, 2012 Take-aways • Text analysis can be a powerful tool. • It's a systematic method of transforming texts to produce new texts for interpretation. • It only augments human judgment and interpretation; it can't replace them. @rybesh #duketext 83
  • 242. Duke Libraries / Text > Data September 20, 2012 Take-aways • Text analysis can be a powerful tool. • It's a systematic method of transforming texts to produce new texts for interpretation. • It only augments human judgment and interpretation; it can't replace them. • Be excited by the possibilities but skeptical of the hype. @rybesh #duketext 83
  • 243. Duke Libraries / Text > Data September 20, 2012 Thanks! @rybesh #duketext 84
  • 244. Duke Libraries / Text > Data September 20, 2012 Thanks! http://aesh.in/RC @rybesh #duketext 84
  • 245. Duke Libraries / Text > Data September 20, 2012 Thanks! http://aesh.in/RC ryanshaw@unc.edu @rybesh #duketext 84

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. 1949 - persuaded IBM to sponsor his project to produce a complete concordance of the works of St. Thomas Aquinas\n30 years\nNot new -- what's new is that it has become affordable, in both money and time\n
  10. title of this workshop mentions "text mining", i prefer\n
  11. \n
  12. through a process of abstraction...\n
  13. through a process of abstraction...\n
  14. through a process of abstraction...\n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. We computed a “suppression index” for each person by dividing their frequency from 1933 – 1945 by the mean frequency in 1925-1933 and in 1955-1965.\n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. designed to capture the sentiment of political texts\n
  81. designed to capture the sentiment of political texts\n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. \n
  97. \n
  98. \n
  99. \n
  100. \n
  101. \n
  102. \n
  103. \n
  104. \n
  105. \n
  106. \n
  107. \n
  108. \n
  109. \n
  110. \n
  111. \n
  112. \n
  113. \n
  114. \n
  115. \n
  116. \n
  117. \n
  118. \n
  119. \n
  120. \n
  121. \n
  122. \n
  123. \n
  124. \n
  125. \n
  126. \n
  127. \n
  128. \n
  129. \n
  130. \n
  131. \n
  132. \n
  133. \n
  134. \n
  135. \n
  136. \n
  137. \n
  138. \n
  139. \n
  140. \n
  141. \n
  142. \n
  143. \n
  144. \n
  145. \n
  146. \n
  147. \n
  148. \n
  149. \n
  150. \n
  151. \n
  152. \n
  153. \n
  154. \n
  155. \n
  156. \n
  157. \n
  158. \n
  159. \n
  160. \n
  161. \n
  162. \n
  163. \n
  164. \n
  165. \n
  166. \n
  167. \n
  168. \n
  169. \n
  170. \n
  171. \n
  172. \n
  173. \n
  174. \n
  175. \n
  176. \n
  177. \n
  178. \n
  179. \n
  180. \n
  181. \n
  182. \n
  183. \n
  184. \n
  185. \n
  186. \n
  187. \n
  188. \n
  189. \n
  190. \n
  191. \n
  192. \n
  193. \n
  194. \n
  195. \n
  196. \n
  197. \n
  198. \n
  199. \n
  200. \n
  201. \n
  202. \n
  203. \n
  204. \n
  205. \n
  206. \n
  207. \n
  208. \n
  209. \n
  210. \n
  211. \n
  212. \n
  213. \n
  214. \n
  215. \n
  216. \n
  217. \n
  218. \n
  219. \n
  220. \n
  221. \n
  222. \n
  223. \n
  224. \n
  225. \n
  226. \n
  227. \n
  228. \n
  229. \n
  230. \n
  231. \n
  232. \n
  233. \n
  234. \n
  235. \n
  236. \n
  237. accuracy: proportion correctly classified\n
  238. accuracy: proportion correctly classified\n
  239. accuracy: proportion correctly classified\n
  240. accuracy: proportion correctly classified\n
  241. \n
  242. \n
  243. \n
  244. \n
  245. \n
  246. \n
  247. \n
  248. \n
  249. \n
  250. \n
  251. \n
  252. \n
  253. \n
  254. \n
  255. \n
  256. \n
  257. \n
  258. \n
  259. \n
  260. \n
  261. \n
  262. \n
  263. \n
  264. \n
  265. \n
  266. \n
  267. \n
  268. \n
  269. \n
  270. \n
  271. \n
  272. \n
  273. \n
  274. \n
  275. \n
  276. \n