Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Text Analytics


Published on

  • Login to see the comments

  • Be the first to like this

Text Analytics

  1. 1. Presented by: Ajay Ram K P
  2. 2. What is Text analytics??  Text analytics is the process of  analyzing unstructured text,  extracting relevant information  and transforming it into useful business intelligence.  Text analytics processes can be performed manually, but the amount of text-based data available to companies today makes it increasingly important to use intelligent, automated solutions. 2
  3. 3. Why is Text Analytics important??  Emails, online reviews, tweets, call center agent notes, and the vast array of other written feedback, all hold insight into customer wants and needs only if you can unlock it.  Text analytics is the way to extract meaning from this unstructured text, and to uncover patterns and themes. 3
  4. 4. Text Analytics in R  Text Analytics in R is carried out with the help of tm package.  It is a framework for text mining applications within R.  Contains functions for actions such as content transformation, word removal, finding frequent terms and lot more 4
  5. 5. The Case Study data  The data used is a collection of game reviews in an Excel sheet.  Game reviews from 1000 gamers are recorded in the data set.  The objective is to do an analysis of these reviews treating all of them as one text and find out the most frequent words. 5
  6. 6. Part 1  The review are read to a variable docs using functions VectorSource(), Corpus().  VectorSource() sets a source for comparison.  Corpus() creates a skeleton of the text. 6 Reading the Data
  7. 7.  Data cleansing is required as most of the reviews are contain punctuations, numbers, stop words etc. that we don’t require for analysis.  Depending out what you are trying to achieve with your analysis, you may want to do the data cleaning step differently.  Data cleansing is done using tm_map() function in R 7 Cleaning the Data
  8. 8.  Converting document into Document Term Matrix  A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.  The tm package stores document term matrixes as sparse matrices for efficacy. Since we only have 1000 reviews and one document we can just convert our term- document-matrix into a normal matrix, which is easier to work with. Code: dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm)  We then take the column sums of this matrix, which will give us a named vector.  And now we can sort this vector to see the most frequently used words. Code: v <- sort(rowSums(m),decreasing=TRUE) head(v) 8 Finding the frequent terms and their frequency
  9. 9. 9
  10. 10.  For plotting the Word Cloud, we use wordcloud package. 10 Plotting the Word Cloud
  11. 11. And Voila!!! 11
  12. 12. Part 2 12 Creating the Network  For network creation, we take help of packages  igraph  sna  network
  13. 13.  Finding the association.  findAssocs() function is used. 13 Creating the Network
  14. 14.  Plotting the graph.  Using igraph package & function 14 Creating the Network
  15. 15. And there it is!!! 15
  16. 16. Another Graph…  Graph where frequent terms are node and number of frequencies are interaction/strength. 16
  17. 17. In case of large networks  Say the network has more than 10K nodes. Such networks will be complicated.  For quantifying such networks we go for statistical aspects of the network.  Use of Random network, Scale-free network or Hierarchical network models in such cases would be fit. 17 Random Network Scale-free Network Hierarchical Network
  18. 18. Where else can network approaches be powerful??  Biological Science  Economics  Computer science 18
  19. 19. THANK YOU!!! 20