SlideShare a Scribd company logo
1 of 31
Download to read offline
Document clustering using LDA
Haridas N <haridas.n@imaginea.com>
@haridas_n
Agenda
● Introduction to LDA
● Other Clustering Methods
● Model pipeline and Training
● Evaluate LDA model results
○ How to measure the quality of results
○ Evaluate the coherence of the topics
○ Cross check the patents in the cluster are similar
LDA: Find natural categories of
millions of documents, and
suggest a name for each
category.
LDA - Latent Dirichlet Allocation
● Generative probabilistic model, which generates documents from topics and
topics from vocabs.
● An Unsupervised Model
● Other clustering algorithms are LSI, PLSI and K-Mean
Clustering Models
LSI
● Dimensionality reduction method using Truncated SVD.
● Document D = N x V
● SVD applied on D = N x T and T x V
● It lacks the interpretability of the topics.
● And representation quality isn’t that good.
PLSI
● Extension to the LSI by making it probabilistic model
LDA Model
● Plate notation of LDA
Probabilistic graphical model.
● Uses Bayesian inference to find
best likelihood estimation.
● Uses Dirichlet priors for Topic
and Vocabs, hence the name LDA
● Alpha and Beta are Dirichlet
priors
● K topics
● N vocabs
● M documents
K-mean clustering
● Kmean applied on top of the Document x Topic dataset.
● After the patents are rearranged based on spatial location, we can assign the topic
number based on existing patents in it.
● LDA is acting as a Dimensionality reduction of sparse Document x Vocab dataset
into Document x Topic matrix which is dense.
● Kmean does good job on dense vectors.
Feature Extraction
Feature Engineering
● Tokenization and text cleanups
● Apply standard and custom stopword filtering
● Noun-chunk extraction using spacy or nltk based taggers.
● N-gram features
○ If lot of data available then unigrams itself gives pretty good result.
● Stemming / Lemmatization
● TF-IDF based feature selection
Model Pipeline
Documents
Tokenize
D x V
Pre
Processing
BOW
(D x V)
LDA
D x T &
T x V
Training
Tech stack
● Developed on spark mllib ( Or you can use gensim if dataset is smaller )
● Have to handle millions of documents
● We use cluster size of 300GB RAM and 50Core CPU.
● S3 to persist the data
● Pre and post processing pipelines
Hyper parameters
● Doc-Concentration prior ( Alpha )
● Topic Concentration prior ( Beta )
● Number of topics ( K )
● Iterations
● Vocab Size or Feature size ( N ) - in BOW format.
● Max-df tuning
● Custom stopwords to further prune noisy vocabs.
Model Evaluation
Challenges on model evaluation
● LDA is an Unsupervised model, how do we cross check the convergence ?
● Test set validation ?
● What measure we use for grid search ?
● How we compare two LDA runs ?
● We want to avoid human bias involved when comparing the topics
Model Evaluation Methods
● Perplexity - Ensure log likelihood function is maximum point, which will bring
perplexity to lower side.
● Plot the sum of probabilities of top 10 vocabs from Topic x Vocab matrix.
● Topic Coherence valuation
● Topic Dependency score
● Manual evaluation framework.
Perplexity
● A measure to know probabilistic models’ likelihood function reached at maximum
point.
● Applied on held-out dataset or test dataset.
● This measure has been used to tune a particular parameter keeping others
constant - similar to Elbow point identification on Kmean.
● Perplexity doesn’t measure the contextual information between words, it’s rather
per word level.
● So it’s not directly usable as final model evaluation metric. We can use it to tune
the hyper parameters of the model.
Probability sum of top 10 vocabs from T x V matrix
Wordcloud based on the word weightage for a topic
Coherence Scores
● Best method which matches close to the manual verification.
● Gives importance to the co-occurrence of the words really there on the document
or not.
● We can control the context window, full document based, paragraph or Sentence
wise.
● Custom sliding window also we can apply.
● Gensim library provides off-the self implementation for standard coherence
scores.
Different Coherence methods
● Umass - Boolean document estimation
● UCI - Sliding window based document estimation
Different Coherence methods
● NPMI - Sliding window based co-occurrence counting.
● Etc..
● Java Implementation - https://github.com/dice-group/Palmetto
● Reference:- https://labs.imaginea.com/post/how-to-measure-topic-coherence/
Coherence scores are used to compare models - Umass
● LDA Run 1 - -5.403614
● LDA Run 2 - -2.780710
● LDA Run 3 - -3.300038
● Higher the score better, these scores better
Topic dependency - Jaccard Distance
● Find how close or distant the topics are
● Helpful to know whether your topics are very dependent or specific in nature
● It’s very easy to calculate, using the top N words from each topic-vocab
distribution.
● Overlap median score can be used as optimisation parameter for grid-search.
Grid search for best parameters
● Make use of the LDADE.
● Differential evolution methods to optimise any black box function
● Best fit if you are training on a small data-size, as you need to do hundreds of
model training to find good param set. Or you need big cluster to reduce the
training time.
● LDADE reduce the overall search space, but still it’s not very low in number
● Rule of thumb you can apply is, if you model trains with in few mins it’s ideal.
● Topic variance between two runs are considered as a loss function.
● Reference: https://labs.imaginea.com/reference/lda-tuning/
Summary
● LDA has been used to find latent topics from documents
● LDA converges well enough and accumulates good words for each topic to
describe it well.
● Can be usable as feature extraction from a document
● Model evaluation is a difficult part, Use coherence scores along with other
measures.
QA
Thank you
Haridas N <hn@haridas.in>

More Related Content

What's hot

Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Datadanielschulz2005
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Datadanielschulz2005
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with ContextSteffen Staab
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
HPCC Systems Engineering Summit Presentation - Collaborative Research with FA...
HPCC Systems Engineering Summit Presentation - Collaborative Research with FA...HPCC Systems Engineering Summit Presentation - Collaborative Research with FA...
HPCC Systems Engineering Summit Presentation - Collaborative Research with FA...HPCC Systems
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandrySri Ambati
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 
MachineLearningMSConference
MachineLearningMSConferenceMachineLearningMSConference
MachineLearningMSConferenceGeorge Simov
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationChamani Shiranthika
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET Journal
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all thatZhibo Xiao
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...KozoChikai
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 

What's hot (20)

Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
 
Language models
Language modelsLanguage models
Language models
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
HPCC Systems Engineering Summit Presentation - Collaborative Research with FA...
HPCC Systems Engineering Summit Presentation - Collaborative Research with FA...HPCC Systems Engineering Summit Presentation - Collaborative Research with FA...
HPCC Systems Engineering Summit Presentation - Collaborative Research with FA...
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark Landry
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
MachineLearningMSConference
MachineLearningMSConferenceMachineLearningMSConference
MachineLearningMSConference
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 

Similar to Document Clustering using LDA | Haridas Narayanaswamy [Pramati]

Software Craftmanship - Cours Polytech
Software Craftmanship - Cours PolytechSoftware Craftmanship - Cours Polytech
Software Craftmanship - Cours Polytechyannick grenzinger
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents Sharvil Katariya
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysisodsc
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - SlidesAditya Joshi
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet SentimentLucinda Linde
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPindico data
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c...
Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c...Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c...
Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c...Dat Nguyen
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Lviv Startup Club
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkPetr Zapletal
 
Sagemaker built_in algorithems.pptx
Sagemaker built_in algorithems.pptxSagemaker built_in algorithems.pptx
Sagemaker built_in algorithems.pptxasifshahzad100
 
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137Anant Corporation
 

Similar to Document Clustering using LDA | Haridas Narayanaswamy [Pramati] (20)

Software Craftmanship - Cours Polytech
Software Craftmanship - Cours PolytechSoftware Craftmanship - Cours Polytech
Software Craftmanship - Cours Polytech
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c...
Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c...Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c...
Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c...
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Sagemaker built_in algorithems.pptx
Sagemaker built_in algorithems.pptxSagemaker built_in algorithems.pptx
Sagemaker built_in algorithems.pptx
 
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
 

More from Pramati Technologies

Graph db - Pramati Technologies [Meetup]
Graph db - Pramati Technologies [Meetup]Graph db - Pramati Technologies [Meetup]
Graph db - Pramati Technologies [Meetup]Pramati Technologies
 
Clojure through the eyes of a Java Nut | [Mixed Nuts] at Pramati Technologies
Clojure through the eyes of a Java Nut | [Mixed Nuts] at Pramati TechnologiesClojure through the eyes of a Java Nut | [Mixed Nuts] at Pramati Technologies
Clojure through the eyes of a Java Nut | [Mixed Nuts] at Pramati TechnologiesPramati Technologies
 
Swift UI - Declarative Programming [Pramati Technologies]
Swift UI - Declarative Programming [Pramati Technologies]Swift UI - Declarative Programming [Pramati Technologies]
Swift UI - Declarative Programming [Pramati Technologies]Pramati Technologies
 
Adaptive Cards - Pramati Technologies
Adaptive Cards - Pramati TechnologiesAdaptive Cards - Pramati Technologies
Adaptive Cards - Pramati TechnologiesPramati Technologies
 
VitaFlow | Mageswaran Dhandapani [Pramati]
VitaFlow | Mageswaran Dhandapani [Pramati]VitaFlow | Mageswaran Dhandapani [Pramati]
VitaFlow | Mageswaran Dhandapani [Pramati]Pramati Technologies
 
Typography Style Transfer using GANs | Pramati
Typography Style Transfer using GANs | Pramati Typography Style Transfer using GANs | Pramati
Typography Style Transfer using GANs | Pramati Pramati Technologies
 
Pramati - Chennai Development Center
Pramati - Chennai Development CenterPramati - Chennai Development Center
Pramati - Chennai Development CenterPramati Technologies
 

More from Pramati Technologies (7)

Graph db - Pramati Technologies [Meetup]
Graph db - Pramati Technologies [Meetup]Graph db - Pramati Technologies [Meetup]
Graph db - Pramati Technologies [Meetup]
 
Clojure through the eyes of a Java Nut | [Mixed Nuts] at Pramati Technologies
Clojure through the eyes of a Java Nut | [Mixed Nuts] at Pramati TechnologiesClojure through the eyes of a Java Nut | [Mixed Nuts] at Pramati Technologies
Clojure through the eyes of a Java Nut | [Mixed Nuts] at Pramati Technologies
 
Swift UI - Declarative Programming [Pramati Technologies]
Swift UI - Declarative Programming [Pramati Technologies]Swift UI - Declarative Programming [Pramati Technologies]
Swift UI - Declarative Programming [Pramati Technologies]
 
Adaptive Cards - Pramati Technologies
Adaptive Cards - Pramati TechnologiesAdaptive Cards - Pramati Technologies
Adaptive Cards - Pramati Technologies
 
VitaFlow | Mageswaran Dhandapani [Pramati]
VitaFlow | Mageswaran Dhandapani [Pramati]VitaFlow | Mageswaran Dhandapani [Pramati]
VitaFlow | Mageswaran Dhandapani [Pramati]
 
Typography Style Transfer using GANs | Pramati
Typography Style Transfer using GANs | Pramati Typography Style Transfer using GANs | Pramati
Typography Style Transfer using GANs | Pramati
 
Pramati - Chennai Development Center
Pramati - Chennai Development CenterPramati - Chennai Development Center
Pramati - Chennai Development Center
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Document Clustering using LDA | Haridas Narayanaswamy [Pramati]

  • 1. Document clustering using LDA Haridas N <haridas.n@imaginea.com> @haridas_n
  • 2. Agenda ● Introduction to LDA ● Other Clustering Methods ● Model pipeline and Training ● Evaluate LDA model results ○ How to measure the quality of results ○ Evaluate the coherence of the topics ○ Cross check the patents in the cluster are similar
  • 3. LDA: Find natural categories of millions of documents, and suggest a name for each category.
  • 4.
  • 5. LDA - Latent Dirichlet Allocation ● Generative probabilistic model, which generates documents from topics and topics from vocabs. ● An Unsupervised Model ● Other clustering algorithms are LSI, PLSI and K-Mean
  • 7. LSI ● Dimensionality reduction method using Truncated SVD. ● Document D = N x V ● SVD applied on D = N x T and T x V ● It lacks the interpretability of the topics. ● And representation quality isn’t that good.
  • 8. PLSI ● Extension to the LSI by making it probabilistic model
  • 9. LDA Model ● Plate notation of LDA Probabilistic graphical model. ● Uses Bayesian inference to find best likelihood estimation. ● Uses Dirichlet priors for Topic and Vocabs, hence the name LDA ● Alpha and Beta are Dirichlet priors ● K topics ● N vocabs ● M documents
  • 10. K-mean clustering ● Kmean applied on top of the Document x Topic dataset. ● After the patents are rearranged based on spatial location, we can assign the topic number based on existing patents in it. ● LDA is acting as a Dimensionality reduction of sparse Document x Vocab dataset into Document x Topic matrix which is dense. ● Kmean does good job on dense vectors.
  • 12. Feature Engineering ● Tokenization and text cleanups ● Apply standard and custom stopword filtering ● Noun-chunk extraction using spacy or nltk based taggers. ● N-gram features ○ If lot of data available then unigrams itself gives pretty good result. ● Stemming / Lemmatization ● TF-IDF based feature selection
  • 13. Model Pipeline Documents Tokenize D x V Pre Processing BOW (D x V) LDA D x T & T x V
  • 15. Tech stack ● Developed on spark mllib ( Or you can use gensim if dataset is smaller ) ● Have to handle millions of documents ● We use cluster size of 300GB RAM and 50Core CPU. ● S3 to persist the data ● Pre and post processing pipelines
  • 16. Hyper parameters ● Doc-Concentration prior ( Alpha ) ● Topic Concentration prior ( Beta ) ● Number of topics ( K ) ● Iterations ● Vocab Size or Feature size ( N ) - in BOW format. ● Max-df tuning ● Custom stopwords to further prune noisy vocabs.
  • 18. Challenges on model evaluation ● LDA is an Unsupervised model, how do we cross check the convergence ? ● Test set validation ? ● What measure we use for grid search ? ● How we compare two LDA runs ? ● We want to avoid human bias involved when comparing the topics
  • 19. Model Evaluation Methods ● Perplexity - Ensure log likelihood function is maximum point, which will bring perplexity to lower side. ● Plot the sum of probabilities of top 10 vocabs from Topic x Vocab matrix. ● Topic Coherence valuation ● Topic Dependency score ● Manual evaluation framework.
  • 20. Perplexity ● A measure to know probabilistic models’ likelihood function reached at maximum point. ● Applied on held-out dataset or test dataset. ● This measure has been used to tune a particular parameter keeping others constant - similar to Elbow point identification on Kmean. ● Perplexity doesn’t measure the contextual information between words, it’s rather per word level. ● So it’s not directly usable as final model evaluation metric. We can use it to tune the hyper parameters of the model.
  • 21. Probability sum of top 10 vocabs from T x V matrix
  • 22. Wordcloud based on the word weightage for a topic
  • 23. Coherence Scores ● Best method which matches close to the manual verification. ● Gives importance to the co-occurrence of the words really there on the document or not. ● We can control the context window, full document based, paragraph or Sentence wise. ● Custom sliding window also we can apply. ● Gensim library provides off-the self implementation for standard coherence scores.
  • 24. Different Coherence methods ● Umass - Boolean document estimation ● UCI - Sliding window based document estimation
  • 25. Different Coherence methods ● NPMI - Sliding window based co-occurrence counting. ● Etc.. ● Java Implementation - https://github.com/dice-group/Palmetto ● Reference:- https://labs.imaginea.com/post/how-to-measure-topic-coherence/
  • 26. Coherence scores are used to compare models - Umass ● LDA Run 1 - -5.403614 ● LDA Run 2 - -2.780710 ● LDA Run 3 - -3.300038 ● Higher the score better, these scores better
  • 27. Topic dependency - Jaccard Distance ● Find how close or distant the topics are ● Helpful to know whether your topics are very dependent or specific in nature ● It’s very easy to calculate, using the top N words from each topic-vocab distribution. ● Overlap median score can be used as optimisation parameter for grid-search.
  • 28. Grid search for best parameters ● Make use of the LDADE. ● Differential evolution methods to optimise any black box function ● Best fit if you are training on a small data-size, as you need to do hundreds of model training to find good param set. Or you need big cluster to reduce the training time. ● LDADE reduce the overall search space, but still it’s not very low in number ● Rule of thumb you can apply is, if you model trains with in few mins it’s ideal. ● Topic variance between two runs are considered as a loss function. ● Reference: https://labs.imaginea.com/reference/lda-tuning/
  • 29. Summary ● LDA has been used to find latent topics from documents ● LDA converges well enough and accumulates good words for each topic to describe it well. ● Can be usable as feature extraction from a document ● Model evaluation is a difficult part, Use coherence scores along with other measures.
  • 30. QA
  • 31. Thank you Haridas N <hn@haridas.in>