SlideShare a Scribd company logo
1 of 15
Latent semantic analysis (LSA) is a technique in natural
language processing, in particular in vectorial semantics,
of analyzing relationships between a set of documents and
the terms they contain by producing a set of concepts
related to the documents and terms.
Wikipedia
Latent semantic analysis
Getting started
Natural language processing (NLP) is a field of computer
science, artificial intelligence, and linguistics concerned with the
interactions between computers and human (natural) languages.
Wikipedia
Natural language processing could be divided in 4 phases:
Grammar analysis
Lexical analysis
Semantic analysis
Syntactic analysis
Apache OpenNLP
Machine learning based toolkit
for the processing of natural
language text.
http://opennlp.apache.org/
LSA
LSA could be seen as a part of NLP
Apache OpenNLP usage examples:
Lexical analysis
Grammar analysis
Syntactic analysis
Part-of-speech tagging
Tokenization
Chunker - Parser
NOTE:
Before the lexical analysis is possible to
use a sentences analysis tool: sentence
detector (Apache OpenNLP).
Supervised machine learning concepts
INPUT DATA
(ex: wikipedia corpus)
Humans produce a finite set of
couples (INPUT,OUTPUT).
It represents the training set.
It can be seen as discrete
function.
Machine learning algorithm
(ex:linear regretion, maximum
entropy, perceptron)
MODEL
OUTPUT DATA
(ex:corpus POSTagged)
Machine produces a model.
It can be seen as a continuous function.
INPUT DATA
(ex: just a document)
OUTPUT DATA
(that document
POSTagged)
Input data are taken
from an infinte set.
Machine, using model
and input, produces
the expected output.
LSA assumes that words that are close in
meaning will occur in similar pieces of text.
LSA is a method for discovering hidden
concepts in document data.
LSA key concepts
Doc 2
Doc 3
Doc 4
Doc 1
Set of documents, each
document contains
several words.
LSA algorithm takes docs and words and
evaluates vectors in a semantic vectorial
space using:
• A documents/words matrix
• Singular value decomposition (SVD)
word1word2
doc1
doc2
doc3
doc4
Semantic vectorial space.
Word1 and word2 are close,
it means that their (latent)
meaning is related.
Example:
Doc 2
Doc 3
Doc 4
Doc 1
Doc1 Doc2 Doc3 Doc4
Word1 1 0 1 0
Word2 1 0 1 1
Word3 0 1 0 1
…
Words/document matrix
1: there are occurrences of
the i-word in the j-doc.
0: there are not occurrences
of the i-word in the j-doc.
The matrix dimension is very
big (thousands of
words, hundreds of
documents).
Matrix SVD decomposition
To reduce the matrix dimension
Semantic Vector or JLSI
libraries:
• SVD decomposition.
• Build the vectorial
semantic space.
word1word2
doc1
doc2
doc4
UIMA to manage the solution
Online references:
http://opennlp.apache.org/documentation/manual/opennlp.html
https://code.google.com/p/semanticvectors/
http://hlt.fbk.eu/en/technology/jlsi
http://uima.apache.org/
http://en.wikipedia.org/wiki/Singular_value_decomposition
http://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors
Coursera video references:
http://www.coursera.org/course/nlangp
http://www.coursera.org/course/ml
Some snipptes and console commands
OpenNLP has a command line tool which is used to train the models.
Trained Model
Models and document
to manage
This snippet takes as inputs 4 files and it evaluates a new file sentence detected, tokenized and POSTtaggered.
Sentences
tokens
tags
Document that is
sentence detected,
tokenized and
POSTaggered, and that
could be, for example,
indexed in a search
engine like Apache Solr.
Note that the lucene-core is
a hierarchical dependency.
.bat file to load the classpath
SemanticVectors has two main functions:
1. Building wordSpace models.
To build the wordSpace model Semantic Vector
needs indexes created by Apache Lucene.
2. Searching through the vectors in such models.
Es: Bible chapter Indexed by Lucene
1. Building wordSpace models using pitt.search.semanticvectors.LSA class from
the index created by Apache Lucene (from a bible chapter).
In this example the Bible
chapter contains 29
documents, and in total
there are 2460 terms.
Semantic Vector builds:
1. 29 vectors that represent the documents (docvector.bin)
2. 2460 vectors that represent the terms (termvector.bin)
This two files represent the wordSpace.
Note that could be also possible to use pitt.search.semanticvectors.BuildIndex class that use Random Projection
instead of LSA to reduce the dimensional representation.
2. Searching through docVector and termVector
2.1 Searching for Documents using Terms
Search for document vectors closest to the vector ”Abraham”:
2.2 Using a document file as a source of queries
Find terms most closely related to Chapter 1 of Chronicles:
2.3 Search a general word
Find terms most closely related to “Abraham”.
2.4 Comparing words
Compare “abraham” with “Isaac”.
Compare “abraham” with “massimo”.

More Related Content

What's hot

175 95 упражнений на все правила русского языка. 3 класс ушакова о.д-2008 -60с
175  95 упражнений на все правила русского языка. 3 класс ушакова о.д-2008 -60с175  95 упражнений на все правила русского языка. 3 класс ушакова о.д-2008 -60с
175 95 упражнений на все правила русского языка. 3 класс ушакова о.д-2008 -60сddfefa
 
cloud computing final year project
cloud computing final year projectcloud computing final year project
cloud computing final year projectAmeya Vashishth
 
1.1 review on algebra 1
1.1 review on algebra 11.1 review on algebra 1
1.1 review on algebra 1math265
 
Artificial intelligence ai l9-hoc may
Artificial intelligence ai l9-hoc mayArtificial intelligence ai l9-hoc may
Artificial intelligence ai l9-hoc mayTráng Hà Viết
 
Ctdl 08-string matching-01
Ctdl 08-string matching-01Ctdl 08-string matching-01
Ctdl 08-string matching-01Bích Anna
 
Lattice lecture Final DM Updated2.ppt
Lattice lecture Final DM Updated2.pptLattice lecture Final DM Updated2.ppt
Lattice lecture Final DM Updated2.pptAkash588342
 

What's hot (6)

175 95 упражнений на все правила русского языка. 3 класс ушакова о.д-2008 -60с
175  95 упражнений на все правила русского языка. 3 класс ушакова о.д-2008 -60с175  95 упражнений на все правила русского языка. 3 класс ушакова о.д-2008 -60с
175 95 упражнений на все правила русского языка. 3 класс ушакова о.д-2008 -60с
 
cloud computing final year project
cloud computing final year projectcloud computing final year project
cloud computing final year project
 
1.1 review on algebra 1
1.1 review on algebra 11.1 review on algebra 1
1.1 review on algebra 1
 
Artificial intelligence ai l9-hoc may
Artificial intelligence ai l9-hoc mayArtificial intelligence ai l9-hoc may
Artificial intelligence ai l9-hoc may
 
Ctdl 08-string matching-01
Ctdl 08-string matching-01Ctdl 08-string matching-01
Ctdl 08-string matching-01
 
Lattice lecture Final DM Updated2.ppt
Lattice lecture Final DM Updated2.pptLattice lecture Final DM Updated2.ppt
Lattice lecture Final DM Updated2.ppt
 

Viewers also liked

Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
Singular Value Decomposition Image Compression
Singular Value Decomposition Image CompressionSingular Value Decomposition Image Compression
Singular Value Decomposition Image CompressionAishwarya K. M.
 
Latent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisLatent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisMercy Livingstone
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGGeorge Simov
 
Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learningSanjib Basak
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsAyush Jain
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
 

Viewers also liked (9)

Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Singular Value Decomposition Image Compression
Singular Value Decomposition Image CompressionSingular Value Decomposition Image Compression
Singular Value Decomposition Image Compression
 
Latent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisLatent Semantic Indexing and Analysis
Latent Semantic Indexing and Analysis
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
 
Vsm lsi
Vsm lsiVsm lsi
Vsm lsi
 
Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learning
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and Applications
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 

Similar to NLP and LSA getting started

A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Indexing in Search Engine
Indexing in Search EngineIndexing in Search Engine
Indexing in Search EngineShikha Gupta
 
Semantic web
Semantic webSemantic web
Semantic webtariq1352
 
An Annotation Framework For The Semantic Web
An Annotation Framework For The Semantic WebAn Annotation Framework For The Semantic Web
An Annotation Framework For The Semantic WebAndrea Porter
 
MR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision ReflectionMR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision ReflectionTakeshi Morita
 
Semantic Annotation: The Mainstay of Semantic Web
Semantic Annotation: The Mainstay of Semantic WebSemantic Annotation: The Mainstay of Semantic Web
Semantic Annotation: The Mainstay of Semantic WebEditor IJCATR
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Robert Monné
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than DataAmit Sheth
 
RDFa Semantic Web
RDFa Semantic WebRDFa Semantic Web
RDFa Semantic WebRob Paok
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 

Similar to NLP and LSA getting started (20)

Lucene
LuceneLucene
Lucene
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Indexing in Search Engine
Indexing in Search EngineIndexing in Search Engine
Indexing in Search Engine
 
Semantic web
Semantic webSemantic web
Semantic web
 
An Annotation Framework For The Semantic Web
An Annotation Framework For The Semantic WebAn Annotation Framework For The Semantic Web
An Annotation Framework For The Semantic Web
 
MR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision ReflectionMR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision Reflection
 
.Net and Rdf APIs
.Net and Rdf APIs.Net and Rdf APIs
.Net and Rdf APIs
 
Semantic Annotation: The Mainstay of Semantic Web
Semantic Annotation: The Mainstay of Semantic WebSemantic Annotation: The Mainstay of Semantic Web
Semantic Annotation: The Mainstay of Semantic Web
 
SNSW CO3.pptx
SNSW CO3.pptxSNSW CO3.pptx
SNSW CO3.pptx
 
Spotlight
SpotlightSpotlight
Spotlight
 
NLP todo
NLP todoNLP todo
NLP todo
 
C04 07 1519
C04 07 1519C04 07 1519
C04 07 1519
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than Data
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than Data
 
RDFa Semantic Web
RDFa Semantic WebRDFa Semantic Web
RDFa Semantic Web
 
G04124041046
G04124041046G04124041046
G04124041046
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

NLP and LSA getting started

  • 1. Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. Wikipedia Latent semantic analysis Getting started
  • 2. Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. Wikipedia Natural language processing could be divided in 4 phases: Grammar analysis Lexical analysis Semantic analysis Syntactic analysis Apache OpenNLP Machine learning based toolkit for the processing of natural language text. http://opennlp.apache.org/ LSA LSA could be seen as a part of NLP
  • 3. Apache OpenNLP usage examples: Lexical analysis Grammar analysis Syntactic analysis Part-of-speech tagging Tokenization Chunker - Parser NOTE: Before the lexical analysis is possible to use a sentences analysis tool: sentence detector (Apache OpenNLP).
  • 4. Supervised machine learning concepts INPUT DATA (ex: wikipedia corpus) Humans produce a finite set of couples (INPUT,OUTPUT). It represents the training set. It can be seen as discrete function. Machine learning algorithm (ex:linear regretion, maximum entropy, perceptron) MODEL OUTPUT DATA (ex:corpus POSTagged) Machine produces a model. It can be seen as a continuous function. INPUT DATA (ex: just a document) OUTPUT DATA (that document POSTagged) Input data are taken from an infinte set. Machine, using model and input, produces the expected output.
  • 5. LSA assumes that words that are close in meaning will occur in similar pieces of text. LSA is a method for discovering hidden concepts in document data. LSA key concepts Doc 2 Doc 3 Doc 4 Doc 1 Set of documents, each document contains several words. LSA algorithm takes docs and words and evaluates vectors in a semantic vectorial space using: • A documents/words matrix • Singular value decomposition (SVD) word1word2 doc1 doc2 doc3 doc4 Semantic vectorial space. Word1 and word2 are close, it means that their (latent) meaning is related.
  • 6. Example: Doc 2 Doc 3 Doc 4 Doc 1 Doc1 Doc2 Doc3 Doc4 Word1 1 0 1 0 Word2 1 0 1 1 Word3 0 1 0 1 … Words/document matrix 1: there are occurrences of the i-word in the j-doc. 0: there are not occurrences of the i-word in the j-doc. The matrix dimension is very big (thousands of words, hundreds of documents). Matrix SVD decomposition To reduce the matrix dimension Semantic Vector or JLSI libraries: • SVD decomposition. • Build the vectorial semantic space. word1word2 doc1 doc2 doc4 UIMA to manage the solution
  • 8. Some snipptes and console commands OpenNLP has a command line tool which is used to train the models. Trained Model
  • 9. Models and document to manage This snippet takes as inputs 4 files and it evaluates a new file sentence detected, tokenized and POSTtaggered. Sentences tokens tags Document that is sentence detected, tokenized and POSTaggered, and that could be, for example, indexed in a search engine like Apache Solr.
  • 10. Note that the lucene-core is a hierarchical dependency. .bat file to load the classpath SemanticVectors has two main functions: 1. Building wordSpace models. To build the wordSpace model Semantic Vector needs indexes created by Apache Lucene. 2. Searching through the vectors in such models. Es: Bible chapter Indexed by Lucene
  • 11. 1. Building wordSpace models using pitt.search.semanticvectors.LSA class from the index created by Apache Lucene (from a bible chapter). In this example the Bible chapter contains 29 documents, and in total there are 2460 terms. Semantic Vector builds: 1. 29 vectors that represent the documents (docvector.bin) 2. 2460 vectors that represent the terms (termvector.bin) This two files represent the wordSpace. Note that could be also possible to use pitt.search.semanticvectors.BuildIndex class that use Random Projection instead of LSA to reduce the dimensional representation.
  • 12. 2. Searching through docVector and termVector 2.1 Searching for Documents using Terms Search for document vectors closest to the vector ”Abraham”:
  • 13. 2.2 Using a document file as a source of queries Find terms most closely related to Chapter 1 of Chronicles:
  • 14. 2.3 Search a general word Find terms most closely related to “Abraham”.
  • 15. 2.4 Comparing words Compare “abraham” with “Isaac”. Compare “abraham” with “massimo”.