4. Information Retrieval
Document
Term
Inverted Index
Term Frequency (tf)
Skip Pointers
Positional Index
Collection Frequency
Document Frequency (df)
Inverse Frequency Idf = Log10(N/df)
Term Frequency Inverse Document Frequency
tf-idf = tf * Idf
www.teknopoint.us
6. Apache SOLR
Fire Powered Lucene
Distributed
Replicated
Remote
And just for the record its…
SEARCH On LUCENE w/REPLICATION (TBHPHB)
www.teknopoint.us
11. Indexing
Indexing is using HTTP POST. So indexed can be posted
to SOLR via a web request
Data can be pulled using Data Import Handler (uses
HTTP GET or DB)
SOLR can index binary content (textual + metadata)
from docs, video, mp3, images and other binary content
www.teknopoint.us
12. Search
Search features:
Paging, Filtering, Sorting, Faceting
Results: xml (Default), json, php, ruby, python etc.
Query Parser: used to interpret queries. 2 types of query
parsers
Lucene Query Syntax Parser
DisMax Parser (Disjunction Max)
www.teknopoint.us
13. Solr integration approaches
Crawl using an external crawler like Nutch or Heritrix
CQ servlets to serialize content into a Solr (JSON/XML)
JCR Observer for page modifications to trigger indexing
to Solr.
www.teknopoint.us
14. AEM 6
2 Types
In Built
Remote (For distributed)
Zookeeper (for setting up a cluster)
Shard – horizontal Partition
Replication – no of copies of the index files
www.teknopoint.us
15. SOLR things we didn’t see
https://github.com/evolvingweb/ajax-solr
http://wiki.apache.org/solr/SolrQuerySyntax
www.teknopoint.us