SlideShare a Scribd company logo
1 of 25
topic modelingfor humans




William Bert
DC Python Meetup
1 May 2012
please go to http://ADDRESS
and enter a sentence



interesting relationships?

gensim generated the data for those visualizations
by computing the semantic similarity of the input
who am I?



William Bert
developer at Carney Labs (teamcarney.com)
user of gensim
still new to world of topic modelling,
semantic similarity, etc
gensim: “topic modeling for humans”

topic modeling attempts to uncover the
underlying semantic structure of by identifying
recurring patterns of terms in a set of data
(topics).

topic modelling
does not parse sentences,
does not care about word order, and
does not "understand" grammar or syntax.
gensim: “topic modeling for humans”
>>> lsi_model.show_topics()
'-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" +
0.132*"software" + 0.119*"fort" + -0.119*"nov" +
0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -
0.105*"water"',

'0.179*"squadron" + 0.158*"smith" + -
0.140*"creek" + 0.135*"chess" + -0.130*"air" +
0.128*"en" + -0.122*"nov" + -0.120*"fr" +
0.119*"jan" + -0.115*"wales"',

'0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -
0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania"
+ 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -
0.090*"engineering"',
gensim isn't about topic modeling
(for me, anyway)
It's about similarity.
What is similarity?
Some types:
• String matching
• Stylometry
• Term frequency
• Semantic (meaning)
Is
A seven-year quest to collect samples from the
solar system's formation ended in triumph in a
dark and wet Utah desert this weekend.
similar in meaning to
For a month, a huge storm with massive
lightning has been raging on Jupiter under the
watchful eye of an orbiting spacecraft.
more or less than it is similar to
One of Saturn's moons is spewing a giant plume
of water vapour that is feeding the planet's
rings, scientists say.
?
Who cares about semantic similarity?


Some use cases:
• Query large collections of text
• Automatic metadata
• Recommendations
• Better human-computer interaction
gensim.corpora
TextCorpus and other kinds of corpus classes
>>> corpus = TextCorpus(file_like_object)
>>> [doc for doc in corpus]
[[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]

corpus = stream of vectors of document
feature ids
for example, words in documents are
features (“bucket of words”)
gensim.corpora
TextCorpus and other kinds of corpus classes
>>> corpus = TextCorpus(file_like_object)
>>> [doc for doc in corpus]
[[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]


Dictionary class
>>> print corpus.dictionary
Dictionary(8472 unique tokens)

dictionary maps features (words) to feature ids
(numbers)
need massive collection of documents that
ostensibly has meaning
sounds like a job for wikipedia
>>>wiki_corpus= WikiCorpus(articles) # articles
is Wikipedia text dump bz2 file. several hours.

>>>wiki_corpus.dictionary.save("wiki_dict.dict")
# persist dictionary

>>>MmCorpus.serialize("wiki_corpus.mm", wiki_corpu
s) # uses numpy to persist corpus in Matrix
Market format. several GBs. can be BZ2’ed.

>>>wiki_corpus= MmCorpus("wiki_corpus.mm")   #
revive a corpus
gensim.models


transform corpora using models classes

for example, term frequency/inverse document
frequency (TFIDF) transformation


reflects importance of a term, not just
presence/absence
gensim.models
>>> tfidf_trans =
models.TfidfModel(wiki_corpus, id2word=dictionar
y) # TFIDF computes frequencies of all document
features in the corpus. several hours.
TfidfModel(num_docs=3430645, num_nnz=547534266)
>>> tfidf_trans[documents] # emits documents
in TFIDF representation. documents must be in
the same BOW vector space as wiki_corpus.
[[(40, 0.23), (6, 0.12), (78, 0.65)], [(39, ...
]
>>> tfidf_corpus = MmCorpus(corpus=tfidf_trans
[wiki_corpus], id2word=dictionary) # builds new
corpus by iterating over documents transformed
to TFIDF
gensim.models


>>> lsi_trans =
models.LsiModel(corpus=tfidf_corpus, id2word=dicti
onary, num_features=400) # creates LSI
transformation model from tfidf corpus
representation
topics again for a bit
>>> lsi_model.show_topics()
'-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" +
0.132*"software" + 0.119*"fort" + -0.119*"nov" +
0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -
0.105*"water"',

'0.179*"squadron" + 0.158*"smith" + -
0.140*"creek" + 0.135*"chess" + -0.130*"air" +
0.128*"en" + -0.122*"nov" + -0.120*"fr" +
0.119*"jan" + -0.115*"wales"',

'0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -
0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania"
+ 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -
0.090*"engineering"',
topics again for a bit
• SVD decomposes a matrix into three simpler matrices
• full rank SVD would be able to recreate the underlying
matrix exactly from those three matrices
• lower-rank SVD provides the best (least square error)
approximation of the matrix
• this approximation can find interesting relationships among
data
• it preserves most information while reducing noise and
merging dimensions associated with terms that have similar
meanings
topics again for a bit

• SVD:
alias-i.com/lingpipe/demos/tutorial/svd/read-me.html

•Original paper:
www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA
_Deerwester1990.pdf

• General explanation:
tottdp.googlecode.com/files/LandauerFoltz-Laham1998.pdf

• Many more
gensim.models


>>> lsi_trans =
models.LsiModel(corpus=tfidf_corpus, id2word=dicti
onary, num_features=400, decay=1.0, chunksize=2000
0) # creates LSI transformation model from tfidf
corpus representation

>>> print lsi_trans
LsiModel(num_terms=100000, num_topics=400, decay=1
.0, chunksize=20000)
gensim.similarities
(the best part)
>>> index =
Similarity(corpus=lsi_transformation[tfidf_trans
formation[index_corpus]], num_features=400, outp
ut_prefix=”/tmp/shard”)
>>>
index[lsi_trans[tfidf_trans[dictionary.doc2bow(to
kenize(query))]]] # similarity of each document
in the index corpus to a new query document
>>> [s for s in index] # a matrix of each
document’s similarities to all other documents
[array([ 1. , 0. , 0.08, 0.01]),
 array([ 0. , 1. , 0.02, -0.02]),
 array([ 0.08, 0.02, 1. , 0.15]),
 array([ 0.01, -0.02, 0.15, 1. ])]
about gensim
four additional models available

dependencies:         optional:
numpy                 Pyro
scipy                 Pattern

created by Radim Rehurek
•radimrehurek.com/gensim
•github.com/piskvorky/gensim
•groups.google.com/group/gensim
thank you


example code, visualization code, and ppt:
github.com/sandinmyjoints

interview with Radim:
williamjohnbert.com
(additional slides)
gensim.models

• term frequency/inverse document frequency
(TFIDF)
• log entropy
• random projections
• latent dirichlet allocation (LDA)
• hierarchical dirichlet process (HDP)
• latent semantic analysis/indexing (LSA/LSI)
slightly more about gensim

Dependencies: numpy and scipy, and optionally
Pyro for distributed and Pattern for lemmatization

data from Lee 2005 and other papers is available
in gensim for tests
gensim: “topic modelling for humans”

>>> lda_model.show_topics()
['0.083*bridge + 0.034*dam + 0.034*river +
0.027*canal + 0.026*construction + 0.014*ferry +
0.013*bridges + 0.013*tunnel + 0.012*trail +
0.012*reservoir',
 '0.044*fight + 0.029*bout + 0.029*via +
0.028*martial + 0.025*boxing + 0.024*submission +
0.021*loss + 0.021*mixed + 0.020*arts +
0.020*fighting',
 '0.086*italian + 0.062*italy + 0.048*di +
0.024*milan + 0.019*rome + 0.014*venice +
0.013*giovanni + 0.012*della + 0.011*florence +
0.011*francesco’]

More Related Content

What's hot

Intro to Python (High School) Unit #2
Intro to Python (High School) Unit #2Intro to Python (High School) Unit #2
Intro to Python (High School) Unit #2Jay Coskey
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
 
Tree Top
Tree TopTree Top
Tree TopeventRT
 
PenTest using Python By Purna Chander
PenTest using Python By Purna ChanderPenTest using Python By Purna Chander
PenTest using Python By Purna Chandernforceit
 
Building social network with Neo4j and Python
Building social network with Neo4j and PythonBuilding social network with Neo4j and Python
Building social network with Neo4j and PythonAndrii Soldatenko
 
Cross-Language Information Retrieval
Cross-Language Information RetrievalCross-Language Information Retrieval
Cross-Language Information RetrievalSumin Byeon
 

What's hot (8)

Working with text data
Working with text dataWorking with text data
Working with text data
 
Intro to Python (High School) Unit #2
Intro to Python (High School) Unit #2Intro to Python (High School) Unit #2
Intro to Python (High School) Unit #2
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 
Tree Top
Tree TopTree Top
Tree Top
 
PenTest using Python By Purna Chander
PenTest using Python By Purna ChanderPenTest using Python By Purna Chander
PenTest using Python By Purna Chander
 
Building social network with Neo4j and Python
Building social network with Neo4j and PythonBuilding social network with Neo4j and Python
Building social network with Neo4j and Python
 
php string part 3
php string part 3php string part 3
php string part 3
 
Cross-Language Information Retrieval
Cross-Language Information RetrievalCross-Language Information Retrieval
Cross-Language Information Retrieval
 

Viewers also liked

Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdaviirpycon
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGGeorge Simov
 
Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learningSanjib Basak
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsAyush Jain
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
 

Viewers also liked (9)

Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdavi
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
 
Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learning
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and Applications
 
Vsm lsi
Vsm lsiVsm lsi
Vsm lsi
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 

Similar to An Introduction to gensim: "Topic Modelling for Humans"

Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityText Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityNodejsFoundation
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with javaGary Sieling
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 
Fuse'ing python for rapid development of storage efficient FS
Fuse'ing python for rapid development of storage efficient FSFuse'ing python for rapid development of storage efficient FS
Fuse'ing python for rapid development of storage efficient FSChetan Giridhar
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
Modern, Scalable, Ambitious apps with Ember.js
Modern, Scalable, Ambitious apps with Ember.jsModern, Scalable, Ambitious apps with Ember.js
Modern, Scalable, Ambitious apps with Ember.jsMike North
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring DataEric Bottard
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadAll Things Open
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.GeeksLab Odessa
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Johan Blomme
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
 
Clojure: Simple By Design
Clojure: Simple By DesignClojure: Simple By Design
Clojure: Simple By DesignAll Things Open
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Redispresentation apac2012
Redispresentation apac2012Redispresentation apac2012
Redispresentation apac2012Ankur Gupta
 
Metaprogramming with JavaScript
Metaprogramming with JavaScriptMetaprogramming with JavaScript
Metaprogramming with JavaScriptTimur Shemsedinov
 

Similar to An Introduction to gensim: "Topic Modelling for Humans" (20)

04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityText Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with java
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
Fuse'ing python for rapid development of storage efficient FS
Fuse'ing python for rapid development of storage efficient FSFuse'ing python for rapid development of storage efficient FS
Fuse'ing python for rapid development of storage efficient FS
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
Modern, Scalable, Ambitious apps with Ember.js
Modern, Scalable, Ambitious apps with Ember.jsModern, Scalable, Ambitious apps with Ember.js
Modern, Scalable, Ambitious apps with Ember.js
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language Instead
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 
Clojure: Simple By Design
Clojure: Simple By DesignClojure: Simple By Design
Clojure: Simple By Design
 
Python: The Dynamic!
Python: The Dynamic!Python: The Dynamic!
Python: The Dynamic!
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Redispresentation apac2012
Redispresentation apac2012Redispresentation apac2012
Redispresentation apac2012
 
Metaprogramming with JavaScript
Metaprogramming with JavaScriptMetaprogramming with JavaScript
Metaprogramming with JavaScript
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

An Introduction to gensim: "Topic Modelling for Humans"

  • 1. topic modelingfor humans William Bert DC Python Meetup 1 May 2012
  • 2. please go to http://ADDRESS and enter a sentence interesting relationships? gensim generated the data for those visualizations by computing the semantic similarity of the input
  • 3. who am I? William Bert developer at Carney Labs (teamcarney.com) user of gensim still new to world of topic modelling, semantic similarity, etc
  • 4. gensim: “topic modeling for humans” topic modeling attempts to uncover the underlying semantic structure of by identifying recurring patterns of terms in a set of data (topics). topic modelling does not parse sentences, does not care about word order, and does not "understand" grammar or syntax.
  • 5. gensim: “topic modeling for humans” >>> lsi_model.show_topics() '-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" + 0.132*"software" + 0.119*"fort" + -0.119*"nov" + 0.116*"miss" + -0.114*"opera" + -0.112*"oct" + - 0.105*"water"', '0.179*"squadron" + 0.158*"smith" + - 0.140*"creek" + 0.135*"chess" + -0.130*"air" + 0.128*"en" + -0.122*"nov" + -0.120*"fr" + 0.119*"jan" + -0.115*"wales"', '0.373*"jan" + -0.236*"chess" + -0.234*"nov" + - 0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania" + 0.096*"view" + -0.092*"fort" + -0.091*"feb" + - 0.090*"engineering"',
  • 6. gensim isn't about topic modeling (for me, anyway) It's about similarity. What is similarity? Some types: • String matching • Stylometry • Term frequency • Semantic (meaning)
  • 7. Is A seven-year quest to collect samples from the solar system's formation ended in triumph in a dark and wet Utah desert this weekend. similar in meaning to For a month, a huge storm with massive lightning has been raging on Jupiter under the watchful eye of an orbiting spacecraft. more or less than it is similar to One of Saturn's moons is spewing a giant plume of water vapour that is feeding the planet's rings, scientists say. ?
  • 8. Who cares about semantic similarity? Some use cases: • Query large collections of text • Automatic metadata • Recommendations • Better human-computer interaction
  • 9. gensim.corpora TextCorpus and other kinds of corpus classes >>> corpus = TextCorpus(file_like_object) >>> [doc for doc in corpus] [[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...] corpus = stream of vectors of document feature ids for example, words in documents are features (“bucket of words”)
  • 10. gensim.corpora TextCorpus and other kinds of corpus classes >>> corpus = TextCorpus(file_like_object) >>> [doc for doc in corpus] [[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...] Dictionary class >>> print corpus.dictionary Dictionary(8472 unique tokens) dictionary maps features (words) to feature ids (numbers)
  • 11. need massive collection of documents that ostensibly has meaning sounds like a job for wikipedia >>>wiki_corpus= WikiCorpus(articles) # articles is Wikipedia text dump bz2 file. several hours. >>>wiki_corpus.dictionary.save("wiki_dict.dict") # persist dictionary >>>MmCorpus.serialize("wiki_corpus.mm", wiki_corpu s) # uses numpy to persist corpus in Matrix Market format. several GBs. can be BZ2’ed. >>>wiki_corpus= MmCorpus("wiki_corpus.mm") # revive a corpus
  • 12. gensim.models transform corpora using models classes for example, term frequency/inverse document frequency (TFIDF) transformation reflects importance of a term, not just presence/absence
  • 13. gensim.models >>> tfidf_trans = models.TfidfModel(wiki_corpus, id2word=dictionar y) # TFIDF computes frequencies of all document features in the corpus. several hours. TfidfModel(num_docs=3430645, num_nnz=547534266) >>> tfidf_trans[documents] # emits documents in TFIDF representation. documents must be in the same BOW vector space as wiki_corpus. [[(40, 0.23), (6, 0.12), (78, 0.65)], [(39, ... ] >>> tfidf_corpus = MmCorpus(corpus=tfidf_trans [wiki_corpus], id2word=dictionary) # builds new corpus by iterating over documents transformed to TFIDF
  • 14. gensim.models >>> lsi_trans = models.LsiModel(corpus=tfidf_corpus, id2word=dicti onary, num_features=400) # creates LSI transformation model from tfidf corpus representation
  • 15. topics again for a bit >>> lsi_model.show_topics() '-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" + 0.132*"software" + 0.119*"fort" + -0.119*"nov" + 0.116*"miss" + -0.114*"opera" + -0.112*"oct" + - 0.105*"water"', '0.179*"squadron" + 0.158*"smith" + - 0.140*"creek" + 0.135*"chess" + -0.130*"air" + 0.128*"en" + -0.122*"nov" + -0.120*"fr" + 0.119*"jan" + -0.115*"wales"', '0.373*"jan" + -0.236*"chess" + -0.234*"nov" + - 0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania" + 0.096*"view" + -0.092*"fort" + -0.091*"feb" + - 0.090*"engineering"',
  • 16. topics again for a bit • SVD decomposes a matrix into three simpler matrices • full rank SVD would be able to recreate the underlying matrix exactly from those three matrices • lower-rank SVD provides the best (least square error) approximation of the matrix • this approximation can find interesting relationships among data • it preserves most information while reducing noise and merging dimensions associated with terms that have similar meanings
  • 17. topics again for a bit • SVD: alias-i.com/lingpipe/demos/tutorial/svd/read-me.html •Original paper: www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA _Deerwester1990.pdf • General explanation: tottdp.googlecode.com/files/LandauerFoltz-Laham1998.pdf • Many more
  • 18. gensim.models >>> lsi_trans = models.LsiModel(corpus=tfidf_corpus, id2word=dicti onary, num_features=400, decay=1.0, chunksize=2000 0) # creates LSI transformation model from tfidf corpus representation >>> print lsi_trans LsiModel(num_terms=100000, num_topics=400, decay=1 .0, chunksize=20000)
  • 19. gensim.similarities (the best part) >>> index = Similarity(corpus=lsi_transformation[tfidf_trans formation[index_corpus]], num_features=400, outp ut_prefix=”/tmp/shard”) >>> index[lsi_trans[tfidf_trans[dictionary.doc2bow(to kenize(query))]]] # similarity of each document in the index corpus to a new query document >>> [s for s in index] # a matrix of each document’s similarities to all other documents [array([ 1. , 0. , 0.08, 0.01]), array([ 0. , 1. , 0.02, -0.02]), array([ 0.08, 0.02, 1. , 0.15]), array([ 0.01, -0.02, 0.15, 1. ])]
  • 20. about gensim four additional models available dependencies: optional: numpy Pyro scipy Pattern created by Radim Rehurek •radimrehurek.com/gensim •github.com/piskvorky/gensim •groups.google.com/group/gensim
  • 21. thank you example code, visualization code, and ppt: github.com/sandinmyjoints interview with Radim: williamjohnbert.com
  • 23. gensim.models • term frequency/inverse document frequency (TFIDF) • log entropy • random projections • latent dirichlet allocation (LDA) • hierarchical dirichlet process (HDP) • latent semantic analysis/indexing (LSA/LSI)
  • 24. slightly more about gensim Dependencies: numpy and scipy, and optionally Pyro for distributed and Pattern for lemmatization data from Lee 2005 and other papers is available in gensim for tests
  • 25. gensim: “topic modelling for humans” >>> lda_model.show_topics() ['0.083*bridge + 0.034*dam + 0.034*river + 0.027*canal + 0.026*construction + 0.014*ferry + 0.013*bridges + 0.013*tunnel + 0.012*trail + 0.012*reservoir', '0.044*fight + 0.029*bout + 0.029*via + 0.028*martial + 0.025*boxing + 0.024*submission + 0.021*loss + 0.021*mixed + 0.020*arts + 0.020*fighting', '0.086*italian + 0.062*italy + 0.048*di + 0.024*milan + 0.019*rome + 0.014*venice + 0.013*giovanni + 0.012*della + 0.011*florence + 0.011*francesco’]

Editor's Notes

  1. Hi everyone, thanks for coming.I’m going to start off with a quick demo app. Please go to the address you see up there. Hookbox somtimes takes several seconds to connect to the channel, so give it some if it’s red. It will turn green. You’ll be invited to submit a sentence, particularly a statement or fact that has a mixture of nouns, verbs, and adjectives. There’s some examples of the kinds of sentence that might work well with this demo, but take a moment to think of a sentence of your own and go ahead and submit. We should see them pop up here on the visualization screen.The idea is to provide a bit of a concrete grounding for the talk, and this will serve as an example of one thing you can do with gensim, or at least the data generated by it.What do we see? We have a table comparing a number of submitted sentences with what I’m going to call similarity scores between them. The darker the green, the higher the score. Are there any interesting results?Here we also have some clustering visualizations that attempt to group the inputs that were found to have highest scores together. How do they cluster together?clickHopefully that worked and showed some interesting relationships among the input. (If not, well, I blame the input.)gensim, which I’ll be talking about today, was generating all the underlying similarity scores, measuring how similar each sentence was to the other ones. I’m going to explain how to get results like this from gensim.
  2. A few quick words about me:William BertDeveloper at Carney Labs for about seven monthsCarney Labs is basically a startup wholly owned by a larger company in Alexandria called Team Carney.I use gensim at work developing a conversational tutoring web app.Topic modelling is still pretty new to me and I’m constantly learning more about it so my knowledge is still growing, but I’m really fascinated by it and trying to learn more by working with it a lot, and by doing things like this presentation.
  3. gensim is a free Python framework for doing topic modelling. I’m going to blaze through a quick overview of topic modelling, then discuss how gensim uses it to do semantic similarity, generating data like what we saw.Topic modelling attempts to uncover the underlying semantic structure of text (or other data) by using statistical techniques to identify abstract, recurring patterns of terms in a set of data. These patterns are called topics. They may or may not correspond to our intuitive notion of a topic. Topic modelling models documents as collections of features, representing the documents as long vectors that indicate the presence/absence of important features, for example, the presence or absence of words in a document. We can use those vectors to create spaces and plot the locations of documents in those spaces and use that as a kind of proxy for their meaning.What isn't topic modelling?Topic modelling: does not parse sentences. in fact, knows nothing aboutword order. makes no attempt to "understand" grammar or languagesyntax. What does a topic look like?
  4. Let’s take a quick look at some topics now, and we’ll come back to them again after I walk through how to generate them.clickThese three abbreviated topics were extracted from a large corpora of texts by gensim using a technique called latent semantic analysis (LSA). Just quickly note how they are collections of words that don’t necessarily/intuitively seem to belong together. There are also positive and negative scalar factors for each word, which get smaller in magnitude as the topic goes on. We don't see it here, but each of these topics actually has thousands more terms—these are just the first ten. So that’s what a topic looks like when I talk about topics, but the truth is...
  5. gensim isn’t really about topic modeling, for me anyway. clickIt’s really about similarity. Topics are a means to an end.clickA few wordsabout similarity, because it can be elusive.clickThere are different kinds of similarity: String matching – how many characters strings have in commonStylometry – which is the similarity of style that looks at, say, length of words or sentences, use of function words, ratio of nouns to verbs, etc. Used to identify authors, for example.Term frequency – do the documents use the same words the same number of times (when scaled and normalized)?The kind of similarity I'm interested in is semantic similarity. Similarity of meanings.But what is that?
  6. Take a moment to read these three sentences.They might be said to share certain elements: non-earth planets, weather, duration, research and data collectionHow would you quantify their similarity? How would you decide that two are more similar to each other than to the third? A study done in Australia in 2005 skipped over the question of defining semantic similarity formally and abstractly and instead defined it as what a sample of Australian college students think is similar. They had students read hundreds of paired short excerpts from news articles (and these sentences are excerpted from some of those)and rank the pairwise similarity on a scale. They then examined all the classifications, and found that they had a correlation of0.6. That’s obviously a positive correlation, but not terribly high. So, humans doesn’t necessarily agree with each other about semantic similarity—it’s kind of a fuzzy notion.That said, we’re going to try to put a number on it. In fact, the study I mentioned found that a particular topic modelling technique called latent semantic analysis (LSA) could achieve also a 0.6 correlation with the human ratings—correlating with the study participant’s choice about as well as they correlated with each other.
  7. Why do we care about semantic similarity? Some use cases for document similarity comparison:-Traditionally used on large document collections. Legal discovery.Answer questions or aid searching huge corpora like government regulations, manuals, patent databases, etc.-Automatic metadata: system can intelligently suggest tags and categories for documents based on other documents they’re similar too. -Something that came up recently on the gensim google group: in a CMS, when a user creates a new post, they want to see posts that may be similar in content. We can do that with semantic similarity , and in fact, someone actuallly made this into a plugin for plone using gensim.-Recommendations, plagiarism dectection, exam scoring. And there are a number of other use cases. There are some new and fast online algorithms that work in realtime, whereas aspreviously, work was often done in batches. This brings up another potential use:-better HCI. Matching on similarityrather than words or with regexes allows us to accept broader ranges of input, in theory. So to make it happen, enter...gensim
  8. To get our topics that we can then use to compute semantic similarity, we’re going to start by turning a large set of documents, which we’ll call a training/background corpus, into numeric vectors. We’ll use the tools in gensim’s corpora package.When I say document, a document can be as short as one word, or as long as many pages of text, or anywhere in between. My examples and the demo app are mostly sentence-size documents.In gensim, a corpus is an iterable that returns its documents as sparse vectors. (A sparse vector is just a compact way of storing large vectors that are mostly zeroes.)Corpus can made from a file, database query, a network stream, etc, as long as you can stream the documents and emit vectors. gensim iterates over the documents in a corpus with generators, so it uses constant memory, which means you can have enormous corpora and indexes.How you generate those vectors from the documents is up to you. (So a corpus isn't inherently tied to words or even language, it could be constructed from features of anything such as music or video, if you can figure out what to use as features [like amplitude or frequencies?] and how to extract them.)If your features are the presence or absence of words, your corpus is in what's called "bucket of words" (BOW) format. Gensim provides a convenience class called TextCorpus for creating a such corpus from a text file.So here we have a list of documents, each document is a list of (feature id, count) tuples. Feature #40 appears one time in document #0, etc.For BOW, we also need a dictionary...
  9. Dictionary maps feature ids back to features (words). The corpus class will generate this for me.So the vectors indicate the presence of words in particular documents, and the resulting matrix containing these vectors will represent all the words appearing in all the documents.
  10. To do interesting and useful things with semantic similarity, we need a good training or background corpus. Finding a good training corpus is something of an art. You want a large collection of documents (at least ten of thousands) that are representative of your problem domain. it can be difficult to find or build such a corpus. clickor, you can just use wikipedia. Helpfullyfor experimenting, gensim comes with a WikiCorpus class and other code for building a corpus from awikipedia article dump. clickWikicorpus makes two passes, one to extract the dictionary, and another to create and store the sparse vectors. It takes about 10 hours on an i7 to generate and serializecorpus and dictionary, though it uses constant memory. The resulting output vectors are about 15 GB uncompressed, about 5 compressed.So after these operations, wiki_corpus is now a BOW vector space representation of Wikipedia, embodied in a large corpus file in the Matrix Market format (popular matrix file format) and a several megabyte dictionary mapping ids to tokens (words). What can we do with our corpus?
  11. We can transform corpora from one vector space to another using models. Transformationscan bring out hidden structure in the corpus, such as revealing relationships between words and documents. They can also represent the corpus in a more compact way, preserving much information while consuming fewer resources.A gensim 'transformation' is any object which accepts a sparse document via dictionary notation and returns another sparse document.”One useful transformation that we can generate from our BOW corpus is term frequency/inverse document frequency (TFIDF). Instead of a count of word appearances in a document, we get a score for each word that also takes into account the global frequency of that word. So a word’s TFIDF value in a given document increases proportionally to the number of times a word appears in that particular document, but is offset by the frequency of the word in the entire corpus, which helps to control for the fact that some words are generally more common than others.
  12. Transformations are initialized with a training corpus,so we realize a TFIDF transformation from the corpus we just generated,wiki_corpus. This also takes several hours to generate for wikipedia.num_docs is number of documents in dictionary.num_nnz is number of non-zeroes in the matrix.clickOnce our model is generated, we can transform documents represented in one vector space model—the wiki_corpus BOW space--and emit them in another—the wiki_corpus TFIDF space as (word_id, word_weight) tuples where weight is a positive, normalized float. These documents can be anything, new and unseen, as long as they have been tokenized and put into the BOW representation using the same tokenizer and dictionary word->id mappings that were used for the wiki corpus.click We can emit these new representations right into a fresh MmCorpus, which could also be serialized and persisted on disk (also requiring several GBs). However,
  13. TFIDF corpus itself is not all that interesting except as a stepping stone to another model called LSI.Latent semantic indexing/analysis (LSI/LSA) —grandaddy of topic modelling similarity techniques. Original paper is from 1990. It produced the results we saw in the visualization.We can generate an LSI model almost the same way we did the TFIDF model, but we do need to provide an extra parameter called num_features, which brings us back to topics...
  14. num_features is a parameter to LSI telling it how many topics to make. Here again are the topics we saw. These were generated by LSI from Wikipedia articles. What are these topics? It’s hard to say, exactly. The “themes” are unclear. But they are in some sense the corpora’s "principal components” (And in fact, principal component analysis is similar, if you know what that is.)Here’s a brief rundown of how LSI works to calculate these topics...
  15. LSI uses a technique called singular value decomposition (SVD) to reduce the original term/document matrix’s number of dimensions and keep the most information for a given number of topics. I understand the technique conceptually but I’m not going to try to get into the math behind it because I don’t really understand it well enough to explain it, and there are plenty of resources online that explain it in great and accurate detail. Nonetheless, I’ll at least describe it briefly: SVD decomposes the word/document matrix into three simpler matrices. Full rank SVD will recreate the underlying matrix exactly, but LSA uses lower-order SVD, which provides the best (in the sense of least square error) approximation of the matrix at lower dimensions. By lowering the rank, dimensions associated with terms that have similar meanings are merged together. This preserves the most important semantic information in the text while reducing noise, and can uncover interesting relationships among the data of the underlying matrix. Still, the meaning of the terms and topics is not really apparent to us. This is because a single LSI topic is not about a single thing; they work as a set, The topics contain both positive and negative values, which cancel each other out delicately when generating vectors for documents.This is one of the reasons LSI topics are hard to interpret.---The original matrix can be too large for the computing resources; in this case, the approximated low rank matrix is interpreted as an approximation (a "least and necessary evil").The original matrix can be noisy: for example, anecdotal instances of terms are to be eliminated. From this point of view, the approximated matrix is interpreted as a de-noisified matrix (a better matrix than the original).The original term-document matrix is presumed overly sparse relative to the "true" term-document matrix. That is, the original matrix lists only the words actually in each document, whereas we might be interested in all words related to each document—generally a much larger set due to synonymy.The consequence of the rank lowering is that some dimensions are combined and depend on more than one term:{(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}This mitigates the problem of identifying synonymy, as the rank lowering is expected to merge the dimensions associated with terms that have similar meanings. It also mitigates the problem with polysemy, since components of polysemous words that point in the "right" direction are added to the components of words that share a similar meaning. Conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense.---A = m*n matrixA = U * S * V^twhere U is an m*k matrix, V is an n*k matrix, and S is a k*k matrix, and k is the rank of the matrix A.
  16. As I said, there are plenty of resources online to explain it, so for now I will direct your questions there.
  17. So we generate our somewhat mysterious LSI model, asking for 400 topics. Interesting note: no one has really figured out how to determine the best numer of topics for a given corpus for LSI, but experimentally people have found good results between 200 and 500.This will also take several hours. LSI model generation can be distributed to multiple CPUs/machines through a library called Python Remote Objects (Pyro), leading to faster model generation times. When it’s done, we have an LSIModel with 100,000 terms—that’s the size of the dictionary we created. The decay parameter gives more emphasis to new documents if any are added to the model after initial generation.Because the SVD algorithm is incremental, the memory load isconstant and can be controlled by a chunksize parameter thatsays how many documents are to be loaded into RAM at once. Larger chunks speed things up, but also require more RAM.
  18. Now we get to the best part. With our LSI transformation, we can now use the classes in gensim.similarities to create an index of all the documents that we want to compare subsequent queries against. The Similarity class uses fixed memory by splitting the index across shards on disk and mmap'ing them in as necessary. output_prefix is for the filenames of the shards.What is index_corpus? Could be my original training/universe corpus--Wikipedia. Then the index would tell me which Wikipedia document any new queries are most similar to. But index corpus could also be a set of entirely different documents, for example arbitrary sentences typed in by a group of Python programmers,and the index will determine which of those documents my query is most similar to. You can even add new documents to an index in realtime.clickSo to calculate similarity of a query, we tokenize and preprocess query the same we treated wiki corpus, then convert to BOW, then do TFIDF transform, then LSI tranform, give that to the index, and we’ll get a list of similarity scores between the query and each document in the index. clickYou can also calculate the similarity scores between all documents in the index and get back a 2-dimensional matrix, which is what the visualization app was doing every time a new document was added to the index. This is what makes realtime similarity comparisons possible, for some value of similar.
  19. A few more things about gensim before I wrap up:TFIDF and LSI are only two of six models it implements. There are a couple more weighting models and a couple more dimensionality reduction models, each with different properties, but I haven’t had a chance to work with those very much.click clickgensim’s dependencies are numpy and scipy, and optionally Pyro for distributed model generation and Pattern for additional input processing.clickTo give credit where credit’s due, I want to say a few words about where gensim comes from.It was created by a Czech guy named Radim Rehurek, and his work to make the algorithms scalable and online contributed to his Phd thesis. He is an active developer and is very helpful on the mailing list. He’s working hard to build a community around gensim and make it into a robust open source project (LGPL license).I asked him a few questions about his work on gensim; the questions and the answers are up on my personal site, williamjohnbert.com.Randim says: "Gensim has no ambition to become an all-encompassing production level tool, with robust failure handling and error recoveries.”But in my experience it has performed well, and Radim mentioned several commercial applications that are using it in addition to universities
  20. Thanks for listening. This presentation, some sample, and the demo app are available on my github page, github.com/sandinmyjointsI should note that the demo web app and the visualization are actually not part of gensim. gensim generated the data but the app is Flask and hookbox, the clustering is scipy and scikit-learn,and the visualization is d3.questions?
  21. In addition to TFIDF, gensim has implemented several VSM algorithms, most of which I know nothing about, but to do justice to gensim’s capabilities:-TFIDF —weights tokens according to importance (local vs global).preserves dimensionality -Log Entropy- another term weighting function that uses log entropy normalization. preserves dimensionality. -Random Projections —approximates tfidf distances but less computationally expensive. reduces dimensionality -Latent Dirichlet Allocations (LDA) — a generative model that produces more human-readable topics. reduces dimensionality.-Hierarchical Dirichlet Process (HDP) — very new, first described in a paper from 2006, but not all operations are fully implemented in gensim yet.-Latent semantic indexing/analysis (LSI/LSA) —grandaddy of topic modelling similarity techniques. reduces dimensionality. Original paper is Deerwester et al 1990. I have used it most.
  22. Dependencies: numpy and scipy, and optionally Pyro for distributed and Pattern for lemmatizationdata from Lee 2005 and other papers is available in gensim for tests
  23. These recurring patterns called topics may or may not correspond to our intuitive notion of a topic. The abbreviated ones printed here were extracted from a large corpora of texts by gensim using a technique called latent dirichlet allocation (LDA) that actually does tend to produce human-readable topics (but not all the techniques do that, and latent semantic analysis, which is what the demo app used and what we’ll be looking at soon, does not)Some themes—they appear to be “about” something, with the terms having a decreasing weighting