Document Clustering using LDA | Haridas Narayanaswamy [Pramati]

Document clustering using LDA
Haridas N <haridas.n@imaginea.com>
@haridas_n

Agenda
● Introduction to LDA
● Other Clustering Methods
● Model pipeline and Training
● Evaluate LDA model results
○ How to measure the quality of results
○ Evaluate the coherence of the topics
○ Cross check the patents in the cluster are similar

LDA: Find natural categories of
millions of documents, and
suggest a name for each
category.

LDA - Latent Dirichlet Allocation
● Generative probabilistic model, which generates documents from topics and
topics from vocabs.
● An Unsupervised Model
● Other clustering algorithms are LSI, PLSI and K-Mean

LSI
● Dimensionality reduction method using Truncated SVD.
● Document D = N x V
● SVD applied on D = N x T and T x V
● It lacks the interpretability of the topics.
● And representation quality isn’t that good.

PLSI
● Extension to the LSI by making it probabilistic model

LDA Model
● Plate notation of LDA
Probabilistic graphical model.
● Uses Bayesian inference to ﬁnd
best likelihood estimation.
● Uses Dirichlet priors for Topic
and Vocabs, hence the name LDA
● Alpha and Beta are Dirichlet
priors
● K topics
● N vocabs
● M documents

K-mean clustering
● Kmean applied on top of the Document x Topic dataset.
● After the patents are rearranged based on spatial location, we can assign the topic
number based on existing patents in it.
● LDA is acting as a Dimensionality reduction of sparse Document x Vocab dataset
into Document x Topic matrix which is dense.
● Kmean does good job on dense vectors.

Feature Engineering
● Tokenization and text cleanups
● Apply standard and custom stopword ﬁltering
● Noun-chunk extraction using spacy or nltk based taggers.
● N-gram features
○ If lot of data available then unigrams itself gives pretty good result.
● Stemming / Lemmatization
● TF-IDF based feature selection

Model Pipeline
Documents
Tokenize
D x V
Pre
Processing
BOW
(D x V)
LDA
D x T &
T x V

Tech stack
● Developed on spark mllib ( Or you can use gensim if dataset is smaller )
● Have to handle millions of documents
● We use cluster size of 300GB RAM and 50Core CPU.
● S3 to persist the data
● Pre and post processing pipelines

Hyper parameters
● Doc-Concentration prior ( Alpha )
● Topic Concentration prior ( Beta )
● Number of topics ( K )
● Iterations
● Vocab Size or Feature size ( N ) - in BOW format.
● Max-df tuning
● Custom stopwords to further prune noisy vocabs.

Challenges on model evaluation
● LDA is an Unsupervised model, how do we cross check the convergence ?
● Test set validation ?
● What measure we use for grid search ?
● How we compare two LDA runs ?
● We want to avoid human bias involved when comparing the topics

Model Evaluation Methods
● Perplexity - Ensure log likelihood function is maximum point, which will bring
perplexity to lower side.
● Plot the sum of probabilities of top 10 vocabs from Topic x Vocab matrix.
● Topic Coherence valuation
● Topic Dependency score
● Manual evaluation framework.

Perplexity
● A measure to know probabilistic models’ likelihood function reached at maximum
point.
● Applied on held-out dataset or test dataset.
● This measure has been used to tune a particular parameter keeping others
constant - similar to Elbow point identiﬁcation on Kmean.
● Perplexity doesn’t measure the contextual information between words, it’s rather
per word level.
● So it’s not directly usable as ﬁnal model evaluation metric. We can use it to tune
the hyper parameters of the model.

Probability sum of top 10 vocabs from T x V matrix

Wordcloud based on the word weightage for a topic

Coherence Scores
● Best method which matches close to the manual veriﬁcation.
● Gives importance to the co-occurrence of the words really there on the document
or not.
● We can control the context window, full document based, paragraph or Sentence
wise.
● Custom sliding window also we can apply.
● Gensim library provides oﬀ-the self implementation for standard coherence
scores.

Different Coherence methods
● Umass - Boolean document estimation
● UCI - Sliding window based document estimation

Different Coherence methods
● NPMI - Sliding window based co-occurrence counting.
● Etc..
● Java Implementation - https://github.com/dice-group/Palmetto
● Reference:- https://labs.imaginea.com/post/how-to-measure-topic-coherence/

Coherence scores are used to compare models - Umass
● LDA Run 1 - -5.403614
● LDA Run 2 - -2.780710
● LDA Run 3 - -3.300038
● Higher the score better, these scores better

Topic dependency - Jaccard Distance
● Find how close or distant the topics are
● Helpful to know whether your topics are very dependent or speciﬁc in nature
● It’s very easy to calculate, using the top N words from each topic-vocab
distribution.
● Overlap median score can be used as optimisation parameter for grid-search.

Grid search for best parameters
● Make use of the LDADE.
● Differential evolution methods to optimise any black box function
● Best fit if you are training on a small data-size, as you need to do hundreds of
model training to find good param set. Or you need big cluster to reduce the
training time.
● LDADE reduce the overall search space, but still it’s not very low in number
● Rule of thumb you can apply is, if you model trains with in few mins it’s ideal.
● Topic variance between two runs are considered as a loss function.
● Reference: https://labs.imaginea.com/reference/lda-tuning/

Summary
● LDA has been used to ﬁnd latent topics from documents
● LDA converges well enough and accumulates good words for each topic to
describe it well.
● Can be usable as feature extraction from a document
● Model evaluation is a diﬃcult part, Use coherence scores along with other
measures.

Thank you
Haridas N <hn@haridas.in>

Document Clustering using LDA | Haridas Narayanaswamy [Pramati]

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Document Clustering using LDA | Haridas Narayanaswamy [Pramati]

Similar to Document Clustering using LDA | Haridas Narayanaswamy [Pramati] (20)

More from Pramati Technologies

More from Pramati Technologies (7)

Recently uploaded

Recently uploaded (20)

Document Clustering using LDA | Haridas Narayanaswamy [Pramati]