What is Text analytics??
Text analytics is the process of
analyzing unstructured text,
extracting relevant information
and transforming it into useful business intelligence.
Text analytics processes can be performed manually, but
the amount of text-based data available to companies
today makes it increasingly important to use intelligent,
Why is Text Analytics important??
Emails, online reviews, tweets, call center agent notes, and
the vast array of other written feedback, all hold insight into
customer wants and needs only if you can unlock it.
Text analytics is the way to extract meaning from this
unstructured text, and to uncover patterns and themes.
Text Analytics in R
Text Analytics in R is carried out with the help of tm
It is a framework for text mining applications within R.
Contains functions for actions such as content
transformation, word removal, finding frequent terms and
The Case Study data
The data used is a collection of game reviews in an Excel
Game reviews from 1000 gamers are recorded in the data
The objective is to do an analysis of these reviews treating
all of them as one text and find out the most frequent words.
The review are read to a variable docs using functions VectorSource(),
VectorSource() sets a source for comparison.
Corpus() creates a skeleton of the text.
Reading the Data
Data cleansing is required as most of the reviews are contain punctuations,
numbers, stop words etc. that we don’t require for analysis.
Depending out what you are trying to achieve with your analysis, you may
want to do the data cleaning step differently.
Data cleansing is done using tm_map() function in R
Cleaning the Data
Converting document into Document Term Matrix
A document-term matrix or term-document matrix is a mathematical matrix that
describes the frequency of terms that occur in a collection of documents. In a
document-term matrix, rows correspond to documents in the collection and
columns correspond to terms.
The tm package stores document term matrixes as sparse matrices for efficacy.
Since we only have 1000 reviews and one document we can just convert our term-
document-matrix into a normal matrix, which is easier to work with.
Code: dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
We then take the column sums of this matrix, which will give us a named
And now we can sort this vector to see the most frequently used words.
Code: v <- sort(rowSums(m),decreasing=TRUE)
Finding the frequent terms and their
Graph where frequent terms are node and number of
frequencies are interaction/strength.
In case of large networks
Say the network has more than 10K nodes. Such networks will be
For quantifying such networks we go for statistical aspects of the network.
Use of Random network, Scale-free network or Hierarchical network models
in such cases would be fit.
Random Network Scale-free Network Hierarchical Network
Where else can network approaches be