This document outlines a two-session course on using R for text analysis in the humanities. In the first session, participants learn the basics of R and RStudio and how to clean text data. They prepare a corpus of science fiction stories for analysis. The second session demonstrates counting word frequencies, plotting results as bar charts and word clouds, and discusses potentials and limitations of computational text analysis with R. Resources for learning R and text analysis packages are provided.
R in the Humanities: Text Analysis and Visualization
1. R in the Humanities: Text Analysis (2022)
Dr Leah Henrickson
Lecturer in Digital Media
School of Media and Communication
University of Leeds
L.R.Henrickson@leeds.ac.uk
twitter.com/leahhenrickson
2. Who am I?
• Lecturer in Digital Media
• Programme Leader, MA New Media
• Book historian
• Digital humanist
• Canadian 🍁
L.R.Henrickson@leeds.ac.uk
twitter.com/leahhenrickson
3. Publication in the next issue of Victorian Review: ‘Tangling and Untangling the Trollopes’, with Eleanor Dumbill
4. Session 1:
Gettin’ to Grips with R
CC Image: https://www.pexels.com/photo/smiling-model-in-pirate-costume-with-smoking-pipe-7000092
5. Overview
This course is a gentle introduction to R for text analysis. Over the course of two sessions you will be taught the basics of the
powerful programming language before being provided with hands-on experience analysing long-form text in the RStudio
development environment.
By the end of the course, you will be able to:
• Navigate the RStudio development environment
• Prepare long-form prose texts for computational analysis using R
• Conduct basic computational analyses of long-form prose texts
• Construct and explain visualisations of computed results
• Critically apply computational text analysis to complement other analytical methods
To complete this course you will need to install:
• R version 3.6 or higher (download at https://www.r-project.org)
• RStudio Desktop: Open Source Edition 1.2 or higher (download at https://www.rstudio.com/products/rstudio)
6. Session 1 Agenda
1. What are R and RStudio?
2. What can R help you do?
3. A quick note about Computational Literary Studies
4. Getting started with R
5. Cleaning text
CC Image: https://www.pexels.com/photo/black-cat-holding-persons-arm-1049764
7. What are R and RStudio?
R is:
• a programming language
• a software environment
• a really fancy calculator
• free/open source
Download: https://cran.r-project.org/mirrors.html
RStudio is:
• an integrated development environment (IDE)
• a great way to make your coding experiences easier, more colourful,
and more fun!
Download: https://www.rstudio.com/products/rstudio/download
8. What can R help you do?
• Count words
• Find linguistic patterns within and across texts
• Compare texts
• Make pretty pictures
But it’s still up to you to explain results.
Also, is R always the most appropriate tool?
CC Image: https://pixabay.com/photos/letters-tiles-word-game-crossword-4938486
9. A quick note about Computational Literary
Studies (CLS)
CLS has a long history (for example, Father Robert Busa, ~1940s),
but has been criticised for:
• Misinterpretation of statistical data (Da)
• Unchecked enthusiasm for technological ‘hype’ (Kirsch)
• Turning literature into data and neglecting reception of works
(Marche)
Da, Nan Z. “The Computational Case against Computational Literary Studies.” Critical Inquiry, vol. 45, 2019,
pp. 601-639.
Kirsch, Adam. “Technology Is Taking Over English Departments.” The New Republic, 2014,
https://newrepublic.com/article/117428/limits-digital-humanities-adam-kirsch. Accessed 21 December 2020.
Marche, Stephen. “Literature Is not Data: Against Digital Humanities.” The Los Angeles Review of Books,
2012, https://lareviewofbooks.org/article/literature-is-not-data-against-digital-humanities. Accessed 21
December 2020.
CC Image: https://melissaterras.org/2013/10/15/for-ada-lovelace-day-father-busas-female-punch-card-operatives
14. The Basics (2/2)
• Data types: character, numeric, integer, logical, complex
• Data structures: vector, list, matrix, data frame, factors
• Keep notes using #
• Need help?
• ?____________
• help()
• install.packages(“[name of package]”)
Meme: https://www.reddit.com/r/ProgrammerHumor/comments/8w54mx/code_comments_be_like
15. Tools > Global Options >
Appearance
(You will need to restart
RStudio to apply these
changes).
16. Let’s clean some text!
CC Image: https://thenounproject.com/term/cleaning/199037
17. You can use whatever corpus you’d like for this course.
However, I have prepared a corpus of twelve texts for you. You may download the corpus at http://tinyurl.com/n8texts.
This corpus includes six public domain texts comprising the first months of Astounding Stories of Super-Science (1930). A full
corpus for the year is available at http://tinyurl.com/n8texts2, if you’d like to use it in your own time.
• astoundingjan1930: https://www.gutenberg.org/ebooks/41481
• astoundingfeb1930: https://www.gutenberg.org/ebooks/28617
• astoundingmar1930: https://www.gutenberg.org/ebooks/29607
• astoundingapr1930: https://www.gutenberg.org/ebooks/29390
• astoundingmay1930: https://www.gutenberg.org/ebooks/29809
• astoundingjun1930: https://www.gutenberg.org/ebooks/29848
• astoundingjul1930: https://www.gutenberg.org/ebooks/29198
• astoundingaug1930: https://www.gutenberg.org/ebooks/29768
• astoundingsep1930: https://www.gutenberg.org/ebooks/29255
• astoundingoct1930: https://www.gutenberg.org/ebooks/29882
• astoundingnov1930: https://www.gutenberg.org/ebooks/29919
• astoundingdec1930: https://www.gutenberg.org/ebooks/30691
18. First, set your working directory: Session > Set Working Directory > Choose Directory > [folder]
install.packages(“tm”)
library(tm)
getwd()
texts <- Corpus(DirSource(“[path to working directory]”)
writeLines(as.character(texts[[4]])
?tm_map
getTransformations()
texts1 <- tm_map(texts, removePunctuation)
texts2 <- tm_map(texts1, removeNumbers)
texts3 <- tm_map(texts2, content_transformer(tolower))
texts4 <- tm_map(texts3, removeWords, stopwords(“english”))
texts_final <- tm_map(texts4, stripWhitespace)
writeLines(as.character(texts_final[[4]])
dtm <- DocumentTermMatrix(texts_final) + use inspect() to take a look!
20. Help me! (2/3)
R Resources
Matthew Jockers, Text Analysis with R for Students of Literature (New York: Springer, 2014)
https://www.matthewjockers.net/text-analysis-with-r-for-students-of-literature/
LinkedIn Learning: R: https://www.linkedin.com/learning/topics/r
Emmanuel Paradis, R for Beginners (2005): https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
Emma Rand, ‘Reproducible Analyses in R’, N8 CIR (2020): https://n8cir.org.uk/events/event-resource/analyses-r
W. N. Venables, D. M. Smith, and the R Core Team, An Introduction to R (2021): https://cran.r-project.org/doc/manuals/r-
release/R-intro.pdf
21. Help me! (3/3)
R Packages for Text Analysis
corpustools (tokenised text analysis): https://cran.r-project.org/web/packages/corpustools
gutenbergr (searching/downloading Project Gutenberg): https://cran.r-project.org/web/packages/gutenbergr
quanteda (quantitative text analysis): https://cran.r-project.org/web/packages/quanteda/index.html
stylo (stylometry): https://cran.r-project.org/web/packages/stylo
syuzhet (sentiment analysis): https://cran.r-project.org/web/packages/syuzhet/index.html
tidytext (a bit of everything!): https://cran.r-project.org/web/packages/tidytext
tm (text mining – what we’ve done here): https://cran.r-project.org/web/packages/tm/index.html
22. Session 2:
Charts, Clouds, and Confidence
Image: https://pixabay.com/illustrations/rainbow-cloud-sunset-colorful-sky-5389074/
23. Session 2 Agenda
1. Any questions from last week?
2. Review of last week’s session (i.e. cleaning text)
3. Counting words
4. Plotting results
5. Making word clouds
6. Wrapping up
CC Images: https://thenounproject.com/term/graph/21394; https://thenounproject.com/term/word-cloud/195993
24. First, set your working directory: Session > Set Working Directory > Choose Directory > [folder]
install.packages(“tm”)
library(tm)
getwd()
texts <- Corpus(DirSource(“[path to working directory]”)
writeLines(as.character(texts[[4]])
?tm_map
getTransformations()
texts1 <- tm_map(texts, removePunctuation)
texts2 <- tm_map(texts1, removeNumbers)
texts3 <- tm_map(texts2, content_transformer(tolower))
texts4 <- tm_map(texts3, removeWords, stopwords(“english”))
texts_final <- tm_map(texts4, stripWhitespace)
writeLines(as.character(texts_final[[4]])
dtm <- DocumentTermMatrix(texts_final) + use inspect() to take a look!
26. Making a bar chart (and then making it look nice):
barplot(freq_d[1:10])
?barplot
install.packages("RColorBrewer")
library(RColorBrewer)
?RColorBrewer
display.brewer.all|)
cols <- brewer.pal(8, “Paired")
barplot(freq_d[1:10], col=cols, main="My Cool Plot", xlab="Word", ylab="Instances")
27. Making a word cloud (and then making it look nice):
install.packages("wordcloud")
library(wordcloud)
matrix <- as.matrix(dtm)
wordbank <- sort(colSums(matrix), decreasing=TRUE)
df <- data.frame(words=names(wordbank), freq=wordbank)
?data.frame
?wordcloud
wordcloud(words=df$words, freq=df$freq, max.words=100, random.order=FALSE, col=cols)
28. Discussion:
What are the potentials?
What are the limitations?
Is R the best choice?
CC Image: https://www.pexels.com/photo/selective-focus-photography-of-traffic-light-1616781
30. Help me! (2/3)
R Resources
Matthew Jockers, Text Analysis with R for Students of Literature (New York: Springer, 2014)
https://www.matthewjockers.net/text-analysis-with-r-for-students-of-literature/
LinkedIn Learning: R: https://www.linkedin.com/learning/topics/r
Emmanuel Paradis, R for Beginners (2005): https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
Emma Rand, ‘Reproducible Analyses in R’, N8 CIR (2020): https://n8cir.org.uk/events/event-resource/analyses-r
W. N. Venables, D. M. Smith, and the R Core Team, An Introduction to R (2021): https://cran.r-project.org/doc/manuals/r-
release/R-intro.pdf
31. Help me! (3/3)
R Packages for Text Analysis
corpustools (tokenised text analysis): https://cran.r-project.org/web/packages/corpustools
gutenbergr (searching/downloading Project Gutenberg): https://cran.r-project.org/web/packages/gutenbergr
quanteda (quantitative text analysis): https://cran.r-project.org/web/packages/quanteda/index.html
stylo (stylometry): https://cran.r-project.org/web/packages/stylo
syuzhet (sentiment analysis): https://cran.r-project.org/web/packages/syuzhet/index.html
tidytext (a bit of everything!): https://cran.r-project.org/web/packages/tidytext
tm (text mining – what we’ve done here): https://cran.r-project.org/web/packages/tm/index.html
32. Thank you!
Dr Leah Henrickson
Lecturer in Digital Media
School of Media and Communication
University of Leeds
L.R.Henrickson@leeds.ac.uk
twitter.com/leahhenrickson
Editor's Notes
Matrix = table
Data frame = table, with my flexible about what can be included in that table