This talk covers how one can find the latent topics from a bunch of documents without any labels (unsupervised learning). Also covered are Latent Dirichlet Allocation (LDA), a type of document clustering model. LDA can be used for multiple NLP pipelines, eg; Document clustering, topic evaluation, feature extraction, Document similarity study, text summarisation etc. Evaluating the quality of result from such unsupervised models are a challenge, we will discuss few such effective evaluation methods.
2. Agenda
● Introduction to LDA
● Other Clustering Methods
● Model pipeline and Training
● Evaluate LDA model results
○ How to measure the quality of results
○ Evaluate the coherence of the topics
○ Cross check the patents in the cluster are similar
3. LDA: Find natural categories of
millions of documents, and
suggest a name for each
category.
4.
5. LDA - Latent Dirichlet Allocation
● Generative probabilistic model, which generates documents from topics and
topics from vocabs.
● An Unsupervised Model
● Other clustering algorithms are LSI, PLSI and K-Mean
7. LSI
● Dimensionality reduction method using Truncated SVD.
● Document D = N x V
● SVD applied on D = N x T and T x V
● It lacks the interpretability of the topics.
● And representation quality isn’t that good.
9. LDA Model
● Plate notation of LDA
Probabilistic graphical model.
● Uses Bayesian inference to find
best likelihood estimation.
● Uses Dirichlet priors for Topic
and Vocabs, hence the name LDA
● Alpha and Beta are Dirichlet
priors
● K topics
● N vocabs
● M documents
10. K-mean clustering
● Kmean applied on top of the Document x Topic dataset.
● After the patents are rearranged based on spatial location, we can assign the topic
number based on existing patents in it.
● LDA is acting as a Dimensionality reduction of sparse Document x Vocab dataset
into Document x Topic matrix which is dense.
● Kmean does good job on dense vectors.
12. Feature Engineering
● Tokenization and text cleanups
● Apply standard and custom stopword filtering
● Noun-chunk extraction using spacy or nltk based taggers.
● N-gram features
○ If lot of data available then unigrams itself gives pretty good result.
● Stemming / Lemmatization
● TF-IDF based feature selection
15. Tech stack
● Developed on spark mllib ( Or you can use gensim if dataset is smaller )
● Have to handle millions of documents
● We use cluster size of 300GB RAM and 50Core CPU.
● S3 to persist the data
● Pre and post processing pipelines
16. Hyper parameters
● Doc-Concentration prior ( Alpha )
● Topic Concentration prior ( Beta )
● Number of topics ( K )
● Iterations
● Vocab Size or Feature size ( N ) - in BOW format.
● Max-df tuning
● Custom stopwords to further prune noisy vocabs.
18. Challenges on model evaluation
● LDA is an Unsupervised model, how do we cross check the convergence ?
● Test set validation ?
● What measure we use for grid search ?
● How we compare two LDA runs ?
● We want to avoid human bias involved when comparing the topics
19. Model Evaluation Methods
● Perplexity - Ensure log likelihood function is maximum point, which will bring
perplexity to lower side.
● Plot the sum of probabilities of top 10 vocabs from Topic x Vocab matrix.
● Topic Coherence valuation
● Topic Dependency score
● Manual evaluation framework.
20. Perplexity
● A measure to know probabilistic models’ likelihood function reached at maximum
point.
● Applied on held-out dataset or test dataset.
● This measure has been used to tune a particular parameter keeping others
constant - similar to Elbow point identification on Kmean.
● Perplexity doesn’t measure the contextual information between words, it’s rather
per word level.
● So it’s not directly usable as final model evaluation metric. We can use it to tune
the hyper parameters of the model.
23. Coherence Scores
● Best method which matches close to the manual verification.
● Gives importance to the co-occurrence of the words really there on the document
or not.
● We can control the context window, full document based, paragraph or Sentence
wise.
● Custom sliding window also we can apply.
● Gensim library provides off-the self implementation for standard coherence
scores.
26. Coherence scores are used to compare models - Umass
● LDA Run 1 - -5.403614
● LDA Run 2 - -2.780710
● LDA Run 3 - -3.300038
● Higher the score better, these scores better
27. Topic dependency - Jaccard Distance
● Find how close or distant the topics are
● Helpful to know whether your topics are very dependent or specific in nature
● It’s very easy to calculate, using the top N words from each topic-vocab
distribution.
● Overlap median score can be used as optimisation parameter for grid-search.
28. Grid search for best parameters
● Make use of the LDADE.
● Differential evolution methods to optimise any black box function
● Best fit if you are training on a small data-size, as you need to do hundreds of
model training to find good param set. Or you need big cluster to reduce the
training time.
● LDADE reduce the overall search space, but still it’s not very low in number
● Rule of thumb you can apply is, if you model trains with in few mins it’s ideal.
● Topic variance between two runs are considered as a loss function.
● Reference: https://labs.imaginea.com/reference/lda-tuning/
29. Summary
● LDA has been used to find latent topics from documents
● LDA converges well enough and accumulates good words for each topic to
describe it well.
● Can be usable as feature extraction from a document
● Model evaluation is a difficult part, Use coherence scores along with other
measures.