SlideShare a Scribd company logo
1 of 23
Download to read offline
Indexing, vector
spaces, search engines
Piero Molino - XY Lab
Information Retrieval
• Let a collection of unstructured information
• Let an information need
• Find an item of the collection that answers the
information need
• Collection = set of texts
• Information need = query
Representation
• Simple/naive assumption: a text can be
represented by the words that it contains
• Bag-of-word representation
• "Me and John and Mary attend this lab"
• Me:1, and: 2, John:1, Mary:1, attend:1, this:1 lab:1
Inverted index
• In order to access the documents, we build an
index, just like humans do
• Inverted Index: word-to-document
• "Indice Analitico"
Example
doc1:"I like football"
doc2:"John likes football"
doc3:"John likes basketball"
I → doc1
like → doc1
football → doc1 doc2
John → doc2 doc3
likes → doc2 doc3
football → doc1 doc2
basketball → doc3
Query
• Information need represented with the same bag-
of-words model
• Who likes basketball? → Who:1 likes:1 basketball:1
Retrieval
• Who:1 likes:1 basketball:1
I → doc1
like → doc1
football → doc1 doc2
John → doc2 doc3
likes → doc2 doc3
football → doc1 doc2
basketball → doc3
• Result: doc2, doc3 (without any order... or no?)
Boolean Model
• doc3 contains both "likes" and "basketball", 2/3 of
the input query terms, doc2 contains only 1/3
• Boolean Model: represents only presence or
absence of query words and ranks document
according to how many query words they contains
• Result: doc3, doc2 (with this prcise order)
likes → doc2 doc3
basketball → doc3
Vector Space Model
• Represent both documents and words with vectors
or points in a (multimensional) geometrical space
• We build a matrix term x documents
doc1 doc2 doc3
I 1 0 0
like 1 0 0
football 1 1 0
John 0 1 1
likes 0 1 1
football 1 1 0
basketball 0 0 1
Graphical Representation
likes
football
1
1
0
doc1
doc2
doc3
Tf-Idf
• In the cells of the matrix there's the value of the
frequency of a word in a document (term frequecy
tf)
• this value can be dumped down by the number of
documents that contain the word (inverse
document frequency idf)
• the final weight is called tf-idf
Tf-Idf Example
• the word "Romeo" is present 7000 times in the document
"Romeo and Juliet" and is present 7 times in the whole
collection of Shakespeare's works
• The tf-idf of "Romeo" in the document "Romeo and Juliet" is
7000x(1/7)=1000
• The word "and" is present 10000 times in the document
"Romeo and Juliet" and is present 100000 times in the
whole collection of Shakespeare
• Thw tf-idf of "and" in the document "Romeo and Juliet" is
10000x(1/100000)=0.1
Similarity
• Queries are like an additional column in matrix
• We can compute a similarity between document
and queries
Similarity
0
doc1
doc2
doc3
query
θ
Vector length normalization
Projection of one normalized
vector over another = cos(θ)
Cosine Similarity
• The cosine of the angle between two vectors is a
measure of how similar two vectors are
• As the vectors represents documents and queries, the
cosine is a measure of similarity of how similar is a
document with respect to the query
• Or how correlated they are: 1 = maximum correlation, 0
= no correlation, -1 = negative correlation
• The documents obtained from the inverted index are
ranked accordin to their cosine similarity wrt the query
Exploiting Hyperlinks
• In the web documents (webpages) are connected
to each other by hyperlinks
• Is there a way to exploit this topology?
• PageRank Algorithm (and several others)
Simulating Random Walks
4
2
6
1 3
7
5
Matrix Representation
1 2 3 4 5 6 7
1 1/2 1/2
2 1/2 1/2
3 1
4 1/4 1/4 1/4 1/4
5 1/2 1/2
6 1/2 1/2
7 1
Eigenvector
• e = random vector, can be interpreted as how
important is a node in the network
• es. e = 1:0.5, 2:0.5, 3:0.5, 4:0.5, ....
• A = the matrix
• e = e · A, repetedly is like simulating walking randomly
on the graph
• the process converges to a vector e where each value
represents the importance in the network
Ranking with PageRank
I → doc1
like → doc1
football → doc1 doc2
John → doc2 doc3
likes → doc2 doc3
football → doc1 doc2
basketball → doc3
doc1 0.3
doc2 0.1
doc3 0.5
e
The inverted index gives doc2 and doc3
using the importance in the eigenvector for ranking we
output the ORDERED list:
doc3, doc2
Real World Ranking
• Real word search engines exploit Vector Space
Model like approaches, RageRank like approaches
and several others
• They balance all the different factors observing
what link you click on after issuing a query and
using them as examples for a machine learning
algorithm
Real World Ranking 2
• A machine learning algorithm works like a baby,
you give him examples of what is a dog and what is
a cat, he learns to look at the pointy ears, the nose
and oder aspects of the animals and when he sees
a new animal he says "cat" or "dog" depending on
his experience
• In this case the animal is the document/webpage,
the aspects are the PageRank value, the VSM
similarity to the query and so on, and the "cat" or
"dog" are "important" and "non-important"
Thank you
for your attention

More Related Content

What's hot

Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade offVARUN KUMAR
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5 Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5 Salah Amean
 
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...Edureka!
 
Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithmPradip Kumar
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Time series database, InfluxDB & PHP
Time series database, InfluxDB & PHPTime series database, InfluxDB & PHP
Time series database, InfluxDB & PHPCorley S.r.l.
 
Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceFarzan Hajian
 
Data science and Artificial Intelligence
Data science and Artificial IntelligenceData science and Artificial Intelligence
Data science and Artificial IntelligenceSuman Srinivasan
 
Market basket analysis
Market basket analysisMarket basket analysis
Market basket analysisVermaAkash32
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptxSadhanaParameswaran
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization Sourabh Sahu
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using PythonShirin Mojarad, Ph.D.
 

What's hot (20)

Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5 Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
 
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
 
Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithm
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Time series database, InfluxDB & PHP
Time series database, InfluxDB & PHPTime series database, InfluxDB & PHP
Time series database, InfluxDB & PHP
 
Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduce
 
Data science and Artificial Intelligence
Data science and Artificial IntelligenceData science and Artificial Intelligence
Data science and Artificial Intelligence
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 
Market basket analysis
Market basket analysisMarket basket analysis
Market basket analysis
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
 

Similar to Indexing, vector spaces, search engines

Indexing vector spaces graph search engines
Indexing vector spaces graph search enginesIndexing vector spaces graph search engines
Indexing vector spaces graph search enginesKenzo Kabuto
 
Vector space model12345678910111213.pptx
Vector space model12345678910111213.pptxVector space model12345678910111213.pptx
Vector space model12345678910111213.pptxsomeyamohsen2
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with LuceneKai Chan
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with LuceneKai Chan
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.pptIshaXogaha
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generationkrisztianbalog
 
introduction into IR
introduction into IRintroduction into IR
introduction into IRssusere3b1a2
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingKorea Sdec
 
Hub102 - Lesson4 - Data Structure
Hub102 - Lesson4 - Data StructureHub102 - Lesson4 - Data Structure
Hub102 - Lesson4 - Data StructureTiểu Hổ
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
 
Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligencekrisztianbalog
 
Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Brian Nauheimer
 
CORE: co-author recommendation using network information and interest similarity
CORE: co-author recommendation using network information and interest similarityCORE: co-author recommendation using network information and interest similarity
CORE: co-author recommendation using network information and interest similarityRory Sie
 
Where is my data (in the cloud) tamir dresher
Where is my data (in the cloud)   tamir dresherWhere is my data (in the cloud)   tamir dresher
Where is my data (in the cloud) tamir dresherTamir Dresher
 

Similar to Indexing, vector spaces, search engines (20)

Indexing vector spaces graph search engines
Indexing vector spaces graph search enginesIndexing vector spaces graph search engines
Indexing vector spaces graph search engines
 
Vector space model12345678910111213.pptx
Vector space model12345678910111213.pptxVector space model12345678910111213.pptx
Vector space model12345678910111213.pptx
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Similarity: Retrieving Documents
Similarity: Retrieving DocumentsSimilarity: Retrieving Documents
Similarity: Retrieving Documents
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
 
introduction into IR
introduction into IRintroduction into IR
introduction into IR
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Search pitb
Search pitbSearch pitb
Search pitb
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modelling
 
Text features
Text featuresText features
Text features
 
Hub102 - Lesson4 - Data Structure
Hub102 - Lesson4 - Data StructureHub102 - Lesson4 - Data Structure
Hub102 - Lesson4 - Data Structure
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligence
 
Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020
 
CORE: co-author recommendation using network information and interest similarity
CORE: co-author recommendation using network information and interest similarityCORE: co-author recommendation using network information and interest similarity
CORE: co-author recommendation using network information and interest similarity
 
Where is my data (in the cloud) tamir dresher
Where is my data (in the cloud)   tamir dresherWhere is my data (in the cloud)   tamir dresher
Where is my data (in the cloud) tamir dresher
 

Recently uploaded

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 

Recently uploaded (20)

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 

Indexing, vector spaces, search engines

  • 1. Indexing, vector spaces, search engines Piero Molino - XY Lab
  • 2. Information Retrieval • Let a collection of unstructured information • Let an information need • Find an item of the collection that answers the information need • Collection = set of texts • Information need = query
  • 3. Representation • Simple/naive assumption: a text can be represented by the words that it contains • Bag-of-word representation • "Me and John and Mary attend this lab" • Me:1, and: 2, John:1, Mary:1, attend:1, this:1 lab:1
  • 4. Inverted index • In order to access the documents, we build an index, just like humans do • Inverted Index: word-to-document • "Indice Analitico"
  • 5. Example doc1:"I like football" doc2:"John likes football" doc3:"John likes basketball" I → doc1 like → doc1 football → doc1 doc2 John → doc2 doc3 likes → doc2 doc3 football → doc1 doc2 basketball → doc3
  • 6. Query • Information need represented with the same bag- of-words model • Who likes basketball? → Who:1 likes:1 basketball:1
  • 7. Retrieval • Who:1 likes:1 basketball:1 I → doc1 like → doc1 football → doc1 doc2 John → doc2 doc3 likes → doc2 doc3 football → doc1 doc2 basketball → doc3 • Result: doc2, doc3 (without any order... or no?)
  • 8. Boolean Model • doc3 contains both "likes" and "basketball", 2/3 of the input query terms, doc2 contains only 1/3 • Boolean Model: represents only presence or absence of query words and ranks document according to how many query words they contains • Result: doc3, doc2 (with this prcise order) likes → doc2 doc3 basketball → doc3
  • 9. Vector Space Model • Represent both documents and words with vectors or points in a (multimensional) geometrical space • We build a matrix term x documents doc1 doc2 doc3 I 1 0 0 like 1 0 0 football 1 1 0 John 0 1 1 likes 0 1 1 football 1 1 0 basketball 0 0 1
  • 11. Tf-Idf • In the cells of the matrix there's the value of the frequency of a word in a document (term frequecy tf) • this value can be dumped down by the number of documents that contain the word (inverse document frequency idf) • the final weight is called tf-idf
  • 12. Tf-Idf Example • the word "Romeo" is present 7000 times in the document "Romeo and Juliet" and is present 7 times in the whole collection of Shakespeare's works • The tf-idf of "Romeo" in the document "Romeo and Juliet" is 7000x(1/7)=1000 • The word "and" is present 10000 times in the document "Romeo and Juliet" and is present 100000 times in the whole collection of Shakespeare • Thw tf-idf of "and" in the document "Romeo and Juliet" is 10000x(1/100000)=0.1
  • 13. Similarity • Queries are like an additional column in matrix • We can compute a similarity between document and queries
  • 14. Similarity 0 doc1 doc2 doc3 query θ Vector length normalization Projection of one normalized vector over another = cos(θ)
  • 15. Cosine Similarity • The cosine of the angle between two vectors is a measure of how similar two vectors are • As the vectors represents documents and queries, the cosine is a measure of similarity of how similar is a document with respect to the query • Or how correlated they are: 1 = maximum correlation, 0 = no correlation, -1 = negative correlation • The documents obtained from the inverted index are ranked accordin to their cosine similarity wrt the query
  • 16. Exploiting Hyperlinks • In the web documents (webpages) are connected to each other by hyperlinks • Is there a way to exploit this topology? • PageRank Algorithm (and several others)
  • 18. Matrix Representation 1 2 3 4 5 6 7 1 1/2 1/2 2 1/2 1/2 3 1 4 1/4 1/4 1/4 1/4 5 1/2 1/2 6 1/2 1/2 7 1
  • 19. Eigenvector • e = random vector, can be interpreted as how important is a node in the network • es. e = 1:0.5, 2:0.5, 3:0.5, 4:0.5, .... • A = the matrix • e = e · A, repetedly is like simulating walking randomly on the graph • the process converges to a vector e where each value represents the importance in the network
  • 20. Ranking with PageRank I → doc1 like → doc1 football → doc1 doc2 John → doc2 doc3 likes → doc2 doc3 football → doc1 doc2 basketball → doc3 doc1 0.3 doc2 0.1 doc3 0.5 e The inverted index gives doc2 and doc3 using the importance in the eigenvector for ranking we output the ORDERED list: doc3, doc2
  • 21. Real World Ranking • Real word search engines exploit Vector Space Model like approaches, RageRank like approaches and several others • They balance all the different factors observing what link you click on after issuing a query and using them as examples for a machine learning algorithm
  • 22. Real World Ranking 2 • A machine learning algorithm works like a baby, you give him examples of what is a dog and what is a cat, he learns to look at the pointy ears, the nose and oder aspects of the animals and when he sees a new animal he says "cat" or "dog" depending on his experience • In this case the animal is the document/webpage, the aspects are the PageRank value, the VSM similarity to the query and so on, and the "cat" or "dog" are "important" and "non-important"
  • 23. Thank you for your attention