R in the Humanities: Text Analysis and Visualization

R in the Humanities: Text Analysis (2022)
Dr Leah Henrickson
Lecturer in Digital Media
School of Media and Communication
University of Leeds
L.R.Henrickson@leeds.ac.uk
twitter.com/leahhenrickson

Who am I?
• Lecturer in Digital Media
• Programme Leader, MA New Media
• Book historian
• Digital humanist
• Canadian 🍁

Publication in the next issue of Victorian Review: ‘Tangling and Untangling the Trollopes’, with Eleanor Dumbill

Session 1:
Gettin’ to Grips with R
CC Image: https://www.pexels.com/photo/smiling-model-in-pirate-costume-with-smoking-pipe-7000092

Overview
This course is a gentle introduction to R for text analysis. Over the course of two sessions you will be taught the basics of the
powerful programming language before being provided with hands-on experience analysing long-form text in the RStudio
development environment.
By the end of the course, you will be able to:
• Navigate the RStudio development environment
• Prepare long-form prose texts for computational analysis using R
• Conduct basic computational analyses of long-form prose texts
• Construct and explain visualisations of computed results
• Critically apply computational text analysis to complement other analytical methods
To complete this course you will need to install:
• R version 3.6 or higher (download at https://www.r-project.org)
• RStudio Desktop: Open Source Edition 1.2 or higher (download at https://www.rstudio.com/products/rstudio)

Session 1 Agenda
1. What are R and RStudio?
2. What can R help you do?
3. A quick note about Computational Literary Studies
4. Getting started with R
5. Cleaning text
CC Image: https://www.pexels.com/photo/black-cat-holding-persons-arm-1049764

What are R and RStudio?
R is:
• a programming language
• a software environment
• a really fancy calculator
• free/open source
Download: https://cran.r-project.org/mirrors.html
RStudio is:
• an integrated development environment (IDE)
• a great way to make your coding experiences easier, more colourful,
and more fun!
Download: https://www.rstudio.com/products/rstudio/download

What can R help you do?
• Count words
• Find linguistic patterns within and across texts
• Compare texts
• Make pretty pictures
But it’s still up to you to explain results.
Also, is R always the most appropriate tool?
CC Image: https://pixabay.com/photos/letters-tiles-word-game-crossword-4938486

A quick note about Computational Literary
Studies (CLS)
CLS has a long history (for example, Father Robert Busa, ~1940s),
but has been criticised for:
• Misinterpretation of statistical data (Da)
• Unchecked enthusiasm for technological ‘hype’ (Kirsch)
• Turning literature into data and neglecting reception of works
(Marche)
Da, Nan Z. “The Computational Case against Computational Literary Studies.” Critical Inquiry, vol. 45, 2019,
pp. 601-639.
Kirsch, Adam. “Technology Is Taking Over English Departments.” The New Republic, 2014,
https://newrepublic.com/article/117428/limits-digital-humanities-adam-kirsch. Accessed 21 December 2020.
Marche, Stephen. “Literature Is not Data: Against Digital Humanities.” The Los Angeles Review of Books,
2012, https://lareviewofbooks.org/article/literature-is-not-data-against-digital-humanities. Accessed 21
December 2020.
CC Image: https://melissaterras.org/2013/10/15/for-ada-lovelace-day-father-busas-female-punch-card-operatives

Terminal (write your script)
Console (run your script)
Environment (your data)
Everything else!

The Basics (1/2)
Calculating
• 10 + 2 (spaces optional)
• 10 – 2
• 10 * 2
• 10 / 2
Strings and Things
• 1:50
• print(“Hello world!”)
• [variable name] <- c(1, 2, 3)
• [variable name][2]
Meme: https://knowyourmeme.com/memes/math-lady-confused-lady

The Basics (2/2)
• Data types: character, numeric, integer, logical, complex
• Data structures: vector, list, matrix, data frame, factors
• Keep notes using #
• Need help?
• ?____________
• help()
• install.packages(“[name of package]”)
Meme: https://www.reddit.com/r/ProgrammerHumor/comments/8w54mx/code_comments_be_like

Tools > Global Options >
Appearance
(You will need to restart
RStudio to apply these
changes).

Let’s clean some text!
CC Image: https://thenounproject.com/term/cleaning/199037

You can use whatever corpus you’d like for this course.
However, I have prepared a corpus of twelve texts for you. You may download the corpus at http://tinyurl.com/n8texts.
This corpus includes six public domain texts comprising the first months of Astounding Stories of Super-Science (1930). A full
corpus for the year is available at http://tinyurl.com/n8texts2, if you’d like to use it in your own time.
• astoundingjan1930: https://www.gutenberg.org/ebooks/41481
• astoundingfeb1930: https://www.gutenberg.org/ebooks/28617
• astoundingmar1930: https://www.gutenberg.org/ebooks/29607
• astoundingapr1930: https://www.gutenberg.org/ebooks/29390
• astoundingmay1930: https://www.gutenberg.org/ebooks/29809
• astoundingjun1930: https://www.gutenberg.org/ebooks/29848
• astoundingjul1930: https://www.gutenberg.org/ebooks/29198
• astoundingaug1930: https://www.gutenberg.org/ebooks/29768
• astoundingsep1930: https://www.gutenberg.org/ebooks/29255
• astoundingoct1930: https://www.gutenberg.org/ebooks/29882
• astoundingnov1930: https://www.gutenberg.org/ebooks/29919
• astoundingdec1930: https://www.gutenberg.org/ebooks/30691

First, set your working directory: Session > Set Working Directory > Choose Directory > [folder]
install.packages(“tm”)
library(tm)
getwd()
texts <- Corpus(DirSource(“[path to working directory]”)
writeLines(as.character(texts[[4]])
?tm_map
getTransformations()
texts1 <- tm_map(texts, removePunctuation)
texts2 <- tm_map(texts1, removeNumbers)
texts3 <- tm_map(texts2, content_transformer(tolower))
texts4 <- tm_map(texts3, removeWords, stopwords(“english”))
texts_final <- tm_map(texts4, stripWhitespace)
writeLines(as.character(texts_final[[4]])
dtm <- DocumentTermMatrix(texts_final) + use inspect() to take a look!

Help me! (1/3)
R Communities
#rstats (Twitter): https://twitter.com/hashtag/rstats
Forwards: https://forwards.github.io
R-Bloggers: https://www.r-bloggers.com
R-Ladies: https://rladies.org
r/rstats: https://www.reddit.com/r/rstats
RStudio Community: https://community.rstudio.com
Stack Overflow: https://stackoverflow.com/questions/tagged/r

Help me! (2/3)
R Resources
Matthew Jockers, Text Analysis with R for Students of Literature (New York: Springer, 2014)
https://www.matthewjockers.net/text-analysis-with-r-for-students-of-literature/
LinkedIn Learning: R: https://www.linkedin.com/learning/topics/r
Emmanuel Paradis, R for Beginners (2005): https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
Emma Rand, ‘Reproducible Analyses in R’, N8 CIR (2020): https://n8cir.org.uk/events/event-resource/analyses-r
W. N. Venables, D. M. Smith, and the R Core Team, An Introduction to R (2021): https://cran.r-project.org/doc/manuals/r-
release/R-intro.pdf

Help me! (3/3)
R Packages for Text Analysis
corpustools (tokenised text analysis): https://cran.r-project.org/web/packages/corpustools
gutenbergr (searching/downloading Project Gutenberg): https://cran.r-project.org/web/packages/gutenbergr
quanteda (quantitative text analysis): https://cran.r-project.org/web/packages/quanteda/index.html
stylo (stylometry): https://cran.r-project.org/web/packages/stylo
syuzhet (sentiment analysis): https://cran.r-project.org/web/packages/syuzhet/index.html
tidytext (a bit of everything!): https://cran.r-project.org/web/packages/tidytext
tm (text mining – what we’ve done here): https://cran.r-project.org/web/packages/tm/index.html

Session 2:
Charts, Clouds, and Confidence
Image: https://pixabay.com/illustrations/rainbow-cloud-sunset-colorful-sky-5389074/

Session 2 Agenda
1. Any questions from last week?
2. Review of last week’s session (i.e. cleaning text)
3. Counting words
4. Plotting results
5. Making word clouds
6. Wrapping up
CC Images: https://thenounproject.com/term/graph/21394; https://thenounproject.com/term/word-cloud/195993

Getting word frequencies and associations:
freq <- colSums(as.matrix(dtm))
freq[1:10]
freq_d <- sort(freq, decreasing=TRUE)
freq_d[1:10]
findFreqTerms(dtm, lowfreq=100)
findAssocs(dtm, “man", 0.95)
?findAssocs

Making a bar chart (and then making it look nice):
barplot(freq_d[1:10])
?barplot
install.packages("RColorBrewer")
library(RColorBrewer)
?RColorBrewer
display.brewer.all|)
cols <- brewer.pal(8, “Paired")
barplot(freq_d[1:10], col=cols, main="My Cool Plot", xlab="Word", ylab="Instances")

Making a word cloud (and then making it look nice):
install.packages("wordcloud")
library(wordcloud)
matrix <- as.matrix(dtm)
wordbank <- sort(colSums(matrix), decreasing=TRUE)
df <- data.frame(words=names(wordbank), freq=wordbank)
?data.frame
?wordcloud
wordcloud(words=df$words, freq=df$freq, max.words=100, random.order=FALSE, col=cols)

Discussion:
What are the potentials?
What are the limitations?
Is R the best choice?
CC Image: https://www.pexels.com/photo/selective-focus-photography-of-traffic-light-1616781

Thank you!
Dr Leah Henrickson
Lecturer in Digital Media
School of Media and Communication
University of Leeds

R in the Humanities: Text Analysis and Visualization

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to R in the Humanities: Text Analysis and Visualization

Similar to R in the Humanities: Text Analysis and Visualization (20)

More from Leah Henrickson

More from Leah Henrickson (20)

Recently uploaded

Recently uploaded (20)

R in the Humanities: Text Analysis and Visualization

Editor's Notes