How to Add Existing Field in One2Many Tree View in Odoo 17
[F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene
1. Information Access with
Apache Lucene
Pierpaolo Basile, PhD
Research Assistant Univeristy Of Bari Aldo Moro
co-founder QuestionCube s.r.l.
pierpaolo.basile@{uniba.it, questioncube.com}
CODE CAMP: 20 LUGLIO - 1 AGOSTO, 2012
S. VITO DEI NORMANNI (BR)
2. Outline
• Search Engine
– Vector Space Model
– Inverted Index
• Apache Lucene
– Indexing
– Searching
• Advanced topics
– Relevance feedback
– Document similarity
– Apache Tika
4. Searching
Collection of
documents
User query
Relevant
documents
5. Information Retrieval
Information Retrieval (IR) is the area of study
concerned with searching for documents,
for information within documents, and
for metadata about documents, as well as that of
searching structured storage, relational databases,
and the World Wide Web.
Wikipedia
6. Information Retrieval Process
User feedback User
Interface
Query Docs
Doc
Text Doc
Text Operations Doc
Ranked
documents Logic view
Query
Indexing
Operations
Query Inverted Index
Searching Index
Retrived documents
Ranking
7. Information Retrieval Model
<D, Q, F, R(qi, dj)>
• D: document representation
• Q: query representation
• F: query/document representation framework
• R(qi, dj): ranking function
8. Bag-of-words representation
Document/query as unordered collection of words
John likes to watch movies. Mary likes too. John also likes to
watch football games.
{"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6,
"football": 7, "games": 8, "Mary": 9, "too": 10}
9. Vector Space Model
Query/Document Framework
• Algebraic model
• Documents/queries are represented as vector
in a geometric space
• Terms are vector components
10. Vector Space Model
d j ( w1, j , w2, j ,, wt , j )
q ( w1,q , w2,q ,, wt ,q )
Each vector component corresponds to a term
The component value wi,j is the weight of the term ti
in the document dj
14. Term-Document matrix
• ‘It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to introduce it.’
• ‘It’s the oldest rule in the book,’ said the King. ‘Then it ought to be Number
One,’ said Alice.
• Alice watched the White King as he slowly struggled up from bar to bar, till at
last she said, ‘Why, you’ll be hours and hours getting to the table, at that rate.
• Alice looked round eagerly, and found that it was the Red Queen. ‘She’s grown a
good deal!’ was her first remark.
• In the pool of light was a billiards table, with two figures moving around it. Alice
walked toward them, and as she approached they turned to look at her.
• Alice lay back, and closed her eyes. There was the Red Queen again, with that
incessant grin. Or was it the Cheshire cat's grin?
18. Inverted Index
Dictionary
Alice D1:2 D2:2 D3:2
…
book D1:1
…
Cheshire Cat D1:1 D3:1
…
grin D3:2
…
King D1:1 D2:1
…
Queen D2:1 D3:1
…
table D2:1 D3:1
…
19. Inverted Index
Dictionary Posting List
Alice D1:2 D2:2 D3:2
…
book D1:1
…
Cheshire Cat D1:1 D3:1
…
grin D3:2
…
King D1:1 D2:1
…
Queen D2:1 D3:1
…
table D2:1 D3:1
…
20. Inverted Index
Dictionary
Alice D1:2 D2:2 D3:2
…
book D1:1
…
Cheshire Cat D1:1 D3:1
…
grin D3:2
…
King D1:1 D2:1
…
Queen D2:1 D3:1
…
table D2:1 D3:1
…
21. Inverted Index
Dictionary
Alice D1:2 D2:2 D3:2
…
book D1:1
…
Cheshire Cat D1:1 D3:1
…
grin D3:2
…
King D1:1 D2:1
…
Queen D2:1 D3:1
…
table D2:1 D3:1
…
22. Inverted Index
Dictionary
Alice D1:2 D2:2 D3:2
…
book D1:1
…
Cheshire Cat D1:1 D3:1
…
grin D3:2
…
King D1:1 D2:1
…
Queen D2:1 D3:1
…
table D2:1 D3:1
…
23. Inverted Index
Dictionary
Alice D1:2 D2:2 D3:2
…
book D1:1
…
Cheshire Cat D1:1 D3:1
…
grin D3:2
…
King D1:1 D2:1
…
Queen D2:1 D3:1
…
table D2:1 D3:1
…
24. Inverted Index
Dictionary
Alice D1:2 D2:2 D3:2
…
book D1:1
…
Cheshire Cat D1:1 D3:1
…
grin D3:2
…
King D1:1 D2:1
…
Queen D2:1 D3:1
…
table D2:1 D3:1
…
25. Inverted Index
Dictionary
Alice D1:2 D2:2 D3:2
…
book D1:1
…
Cheshire Cat D1:1 D3:1
…
grin D3:2
…
King D1:1 D2:1
…
Queen D2:1 D3:1
…
table D2:1 D3:1
…
26. Inverted Index
Dictionary
Alice D1:2 D2:2 D3:2
…
book D1:1
…
Cheshire Cat D1:1 D3:1
…
grin D3:2
…
King D1:1 D2:1
…
Queen D2:1 D3:1
…
table D2:1 D3:1
…
27. Inverted Index
Dictionary Posting List Position List
Alice D1:2 D2:2 D3:2 14, 49 1, 40 18, 34
…
book D1:1 33
…
Cheshire Cat D1:1 D3:1
…
grin D3:2
… Term positions in D1
King D1:1 D2:1
…
Queen D2:1 D3:1
…
table D2:1 D3:1
…
28. Inverted Index: query processing
Dictionary
Alice D1:2 D2:2 D3:2
…
book D1:1
… AND (intersection )
Cheshire Cat D1:1 D3:1
… D1:2 D2:2 D3:2
grin D3:2
<Alice><King>
…
D1:1 D2:1
King D1:1 D2:1
…
Queen D2:1 D3:1 Alice AND King -> (D1, D2)
…
table D2:1 D3:1
…
29. Inverted Index: query processing
Dictionary
Alice D1:2 D2:2 D3:2
…
book D1:1
… OR (union )
Cheshire Cat D1:1 D3:1
… D3:2
grin D3:2
<grin> <King>
…
D1:1 D2:1
King D1:1 D2:1
…
Queen D2:1 D3:1 Alice AND King -> (D1, D2, D3)
…
table D2:1 D3:1
…
30. Inverted Index: query processing
Dictionary
Alice D1:2 D2:2 D3:2
…
book D1:1
… NOT (complement )
Cheshire Cat D1:1 D3:1
… D1:2 D2:2 D3:2
grin D3:2
<Alice> <grin>
…
D3:2
King D1:1 D2:1
…
Queen D2:1 D3:1 Alice NOT grin -> (D1, D2)
…
table D2:1 D3:1
…
31. Term-weight
• Measures the term relevance in a document
– component value in the document representation
– TF*IDF
• TF (term frequency): term occurrences in the document
• IDF (inverse document frequency): inverse to the
number of documents in which the term occurs
D
tf * idf (t , d ) tf (t , d ) * log
{d D : t d }
32. Term-weight
• Measures the term relevance in a document
– component value in the document representation
– TF*IDF
• TF (term frequency): term occurrences in the document
• IDF (inverse document frequency): inverse to the
number of documents in which the term occurs
Number of
documents in
D the collection
tf * idf (t , d ) tf (t , d ) * log
{d D : t d }
idf
33. TF*IDF insights
• increases proportionally to the term
frequency in the document
• decreases to the number of documents in
which the term belongs
– common words are generally more frequent in the
collection
• IDF depends on the collection, TF on the
document
34. Inverted index/TF*IDF
• TF: computed by term occurrences in the
posting list
• IDF: computed by the posting list cardinality
Alice D1:2 D2:2 D3:2 |D|=100
TF
100
tf * idf ( Alice , D1) 2 * log
3
35. Inverted index/TF*IDF
• TF: computed by term occurrences in the
posting list
• IDF: computed by the posting list cardinality
Alice D1:2 D2:2 D3:2 |D|=100
IDF
100
tf * idf ( Alice , D1) 2 * log
3
37. Apache Lucene
• Apache Software Foundation project
– http://lucene.apache.org
• What Lucene is
– Search library: indexing and searching Application
Programming Interface (API)
• What Lucene is not
– Search engine (no crawling, server, user interface,
etc.)
38. Lucene
• Full-text search
• High performance
• Scalable
• Cross-platform
• 100%-pure Java
• Porting in other languages: .NET, Python
39. Information Retrieval Process
User feedback User
Interface
Query
Text Doc
Text Operations
Ranked
documents Logic view
Query
Indexing
Operations
Query Inverted Index
Searching Index
Retrived documents
Ranking
40. Information Retrieval Process
User feedback User
Interface
org.apache.lucene
Query
Text Document
Analyzer/Tokenizer/Filter
Ranked
documents Logic view
QueryParser IndexWriter
Query Inverted Index
IndexReader
IndexSearcher Index
Retrived documents
Similarity
41. Lucene Model 1/2
• Based on Vector Space Model
• Features:
– multi-field document
– tf–term frequency: number of matching terms in field
– lengthNorm: number of tokens in field
– idf: inverse document frequency
– coord: coordination factor, number of matching
• Terms
– field boost
– query clause boost
42. Lucene Model 1/2
• coor(q,d) score factor based on how many of the query terms are
found in the specified document
• queryNorm(q,d) is a normalizing factor used to make scores
between queries comparable
• boost(t) term boost
• norm(t,d) encapsulates a few (indexing time) boost and length
factors
44. Document Indexing
FSDirectory dir = FSDirectory.open(new File(“/home/user/index_test));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, config);
Document doc = new Document();
doc.add(new Field("super_name", "Sandman", Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field("name", "William Baker", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("category", "superhero", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.close();
45. Field Options
• Field.Store
– NO: field value is not stored
– YES: field value is stored
• Field.Index
– ANALYZED: field is analyzed by the analyzer
– NO: not indexed
– NOT_ANALYZED: indexed but not analyzed
49. Text Operations
• Tokenization
– split text in token
• Stop word elimination
– remove common words and closed word class
(e.g. function words)
• Stemming
– reducing inflected (or sometimes derived) words
to their stem
50. Analyzers
• KeywordAnalyzer
– Returns a single token
• WhitespaceAnalyzer
– Splits tokens at whitespace
• Simple Analyzer
– Divides text at nonletter characters and lowercases
• StopAnalyzer
– Divides text at non letter characters, lowercases, and removes
stop words
• StandardAnalyzer
– Tokenizes based on sophisticated grammar that recognizes e-
mail addresses, acronyms, etc.; lowercases and removes stop
words (optional)
51. Analyzers
The quick brown fox jumped over the lazy dogs
• WhitespaceAnalyzer:
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
• SimpleAnalyzer:
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
• StopAnalyzer:
[quick] [brown] [fox] [jumped] [lazy] [dogs]
• StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
53. Tokenizers
• A Tokenizer is a TokenStream whose input is a Reader
• Tokenizers break field text into tokens
• StandardTokenizer
– source string: “full-text lucene.apache.org”
– “full” “text” “lucene.apache.org”
• WhitespaceTokenizer
– “full-text” “lucene.apache.org”
• LetterTokenizer
– “full” “text” “lucene” “apache” “org”
54. TokenFilters
• A TokenFilter is a TokenStream whose input is
another token stream
• LowerCaseFilter
• StopFilter
• LengthFilter
• PorterStemFilter
– stemming: reducing words to root form
– rides, ride, riding => ride
– country, countries => countri
55. Analyzer
public class MyAnalyzer2 extends Analyzer {
private Set sw = StopFilter.makeStopSet(Version.LUCENE_36, new
ArrayList(StopAnalyzer.ENGLISH_STOP_WORDS_SET));
@Override
public TokenStream tokenStream(String string, Reader reader) {
TokenStream ts = new WhitespaceTokenizer(Version.LUCENE_36, reader);
ts = new LowerCaseFilter(Version.LUCENE_36, ts);
ts = new StopFilter(Version.LUCENE_36, ts, sw);
return ts;
}
}
56. TermVectors
• Store information about term frequency,
positions and offsets
• Relevance Feedback and “More Like This”
• Clustering
• Similarity between two documents
• Highlighter
– needs offsets info
57. Lucene TermVector
• TermFreqVector is a representation of all of the
terms and term counts in a specific Field of a
Document instance
• As a tuple:
– termFreq = <term, term countD>
– <fieldName, <…,termFreqi, termFreqi+1,…>>
• As Java:
– public String getField();
– public String[] getTerms();
– public int[] getTermFrequencies()
58. Lucene TermPositionVector
• TermPositionVector is a representation of all
of the terms, term counts, term position and term
offsets in a specific Field of a Document instance
• As Java:
– public String getField();
– public String[] getTerms();
– public int[] getTermFrequencies()
– public int[] getTermPositions(int index)
– public TermVectorOffsetInfo[]
getOffsets(int index)
59. Store TermVector
• During indexing, create a Field that stores Term
• Vectors:
– new
Field(“text",text,Field.Store.YES,Field.Index.A
NALYZED,Field.TermVector.YES);
• Options are:
– Field.TermVector.YES
– Field.TermVector.NO
– Field.TermVector.WITH_POSITIONS
– Field.TermVector.WITH_OFFSETS
– Field.TermVector.WITH_POSITIONS_OFFSETS
60. Term Vector: Positions and Offsets
Positions
0 1 2 3 4 5 6 7 8
The quick brown fox jumped over the lazy dogs
0<--->3 4<----->9 10<----->15 16<--->19 20 <------> 26 27<---->31 32<--->35 36<---->40 41<---->45
Offsets
61. Example
• Main method with three parameters
– Input_dir: directory to index
– Index_dir: directory containing the index files
– Flag (string) corresponding to different
Field.TermVector... options
• t: tokenize
• tv: term vectors
• tvp: term vectors with positions
• tvo: term vectors with positions and offsets
• The constructor of the Index class takes in input
the three parameters and indexes the directory
– Filters only “.txt” files
62. Lucene Search
• IndexReader: interface for accessing an index
• QueryParser: parses the user query
• IndexSearcher: implements search over a
single IndexReader
– search(Query query, int numDoc)
– TopDocs -> result of search
• TopDocs.scoreDocs returns an array of retrieved
documents
63. Lucene Search
Directory dir = FSDirectory.open(new File(args[0]));
IndexReader reader = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
QueryParser parser = new QueryParser(Version.LUCENE_36, "text", analyzer);
Query query = parser.parse(args[1]);
TopDocs topDocs = searcher.search(query, 100);
System.out.println("matches: " + topDocs.totalHits);
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
Document doc = reader.document(topDocs.scoreDocs[i].doc);
System.out.println("name: " + doc.get("name") + ", score: " +
topDocs.scoreDocs[i].score);
System.out.println();
}
searcher.close();
64. Lucene QueryParser
• Example:
– queryParser.parse(“name:SpiderMan”);
• good human entered queries, debugging
• does text analysis and constructs appropriate
queries
• not all query types supported
66. QUERY SYNTAX 2/2
• Boosting a Term
– jakarta^4 apache
– "jakarta apache"^4 "Apache Lucene”
• Boolean Operator
– NOT, OR, AND
– + required operator
– - prohibit operator
• Grouping by ( )
• Field Grouping
• title:(+return +"pink panther")
• Escaping Special Characters by
67. QUERY (BY CODE) 1/2
TermQuery query =
new TermQuery(new Term(“name”,”Spider-Man”))
• explicit, no escaping necessary
• does not do text analysis for you
• Query
– TermQuery
– BooleanQuery
– PhraseQuery
– FuzzyQuery / WildcardQuery /
68. QUERY (BY CODE) 2/2
TermQuery tq1 = new TermQuery(new Term(“name”,”pippo”))
TermQuery tq2 = new TermQuery(new Term(“name”,”pluto”))
BooleanQuery bq = new BooleanQuery();
bq.add(tq1, BooleanClause.Occur.MUST);
bq.add(tq2, BooleanClause.Occur.SHOULD);
69. DELETING DOCUMENTS
IndexWriter
– deleteDocuments(Term t)
– deleteDocuments(Term… t)
– deleteDocuments(Query t)
– deleteDocuments(Query… t)
– updateDocument(Term t, Document d)
– updateDocument(Term t, Document d,
Analyzer a)
– deleting does not immediately reclaim space
71. Relevance feedback
• Improve retrieval performance using
information about document relevance
– Explicit: relevance indicated by the user (assessor)
– Implicit: from user behavior
– Blind (or pseudo): using information about the top
k retrieved documents
74. Rocchio Algorithm (blind)
1 1
Qnew Q Dj Dk
Drel D j Drel Dnorel Dk Dnorel
In blind (pseudo) relevance feedback we know only
relevant documents (supposed to be the top K
documents)
(generally adopted by search engine)
75. Relevance feedback
Q = Alice Qnew = Alice Cheshire Cat King
book
It’s a friend of mine — a Cheshire It’s a friend of mine — a Cheshire
Cat,’ said Alice: ‘allow me to Cat,’ said Alice: ‘allow me to
introduce it. introduce it.
It’s the oldest rule in the book,’ It’s the oldest rule in the book,’
said the King. ‘Then it ought to said the King. ‘Then it ought to
be Number One,’ said Alice be Number One,’ said Alice
Alice looked round eagerly, and found Alice book was reprinted and
that it was the Red Queen. ‘She’s grown a published in 1866.
good deal!’ was her first remark.
…
Alice lay back, and closed her eyes. There
was the Red Queen again, with that
incessant grin. Or was it the Cheshire cat's
grin?
76. Relevance feedback in Lucene
• Possible using TermVector
– Retrieve the top K documents
– Build the Qnew using TermVector
– Re-query using Qnew
77. Document similarity
• Documents are represented by vectors
– Build the document vector using TermVector
– Compute the similarity using cosine similarity
• Implement a functionality like “similar to this
document”
78. Apache Tika
• Extract metadata and text from several file
formats
– XML, HTML, OpenOffice, Microsoft Office, PDF
– MBOX format (email)
– Compressed files
– EXIF metadata from image
– Metadata from audio file (MP3)
• Apache project as Lucene
79. Extract metadata and text
Tika tika = new Tika();
String type = tika.detect(new File(args[0]));
System.out.println("File type: " + type);
Metadata metadata = new Metadata();
String text = tika.parseToString(new FileInputStream(new File(args[0])),
metadata);
System.out.println("Metadata");
String[] names = metadata.names();
for (int i = 0; i < names.length; i++) {
String value = metadata.get(names[i]);
if (value != null) {
System.out.println(names[i] + "=" + value);
}
}
System.out.println("Text: " + text);
80. Apache Tika + Lucene
Files/URLs
metadata
Tika Lucene
text
82. Conclusions
• A popular IR model
– Vector Space Model
• Lucene
– provides API to build search engine
• Apace Tika
– extracts metadata and text from files and URLs
• LET’S GO TO BUILD YOUR SEARCH ENGINE!
83. Lucene Contrib
• analyzers: new analyzer
– Arabic, Chinese, …
• highlighter: a set of classes for highlighting
matching terms in search results
• Snowball: stemmer analyzer based on
Snowball library http://snowball.tartarus.org
• spellchecker: tools for spellchecking and
suggestions with Lucene
84. Lucene related project
• Apache Solr: open source enterprise search
platform from the Apache Lucene
– Full-Text Search Capabilities, High Volume Web Traffic,
HTML Administration Interfaces
• Apache Nutch: web-search engine based on Solr
and Lucene
– crawler, link-graph database, HTML