2. About me:
Ard Schrijvers
1. Working at Hippo since 2001
2. Email: a.schrijvers@onehippo.com
ard@apache.org
3. Worked primarily on:
1. HST
2. Hippo Repository / Jackrabbit
3. Lucene
4. Cocoon
5. Slide
4. Apache committer of Jackrabbit and Cocoon
4. Outline
1. The current search (HST / repo) architecture
2. The current problems / shortcomings / mismatches
5. Outline
1. The current search (HST / repo) architecture
2. The current problems / shortcomings / mismatches
3. What we are trying to improve, the objectives
6. Outline
1. The current search (HST / repo) architecture
2. The current problems / shortcomings / mismatches
3. What we are trying to improve, the objectives
4. Solr integration to rescue
7. Outline
1. The current search (HST / repo) architecture
2. The current problems / shortcomings / mismatches
3. What we are trying to improve, the objectives
4. Solr integration to rescue
5. A very fast demo
8. Outline
1. The current search (HST / repo) architecture
2. The current problems / shortcomings / mismatches
3. What we are trying to improve, the objectives
4. Solr integration to rescue
5. A very fast demo
6. Wrap up
9. Outline
1. The current search (HST / repo) architecture
2. The current problems / shortcomings / mismatches
3. What we are trying to improve, the objectives
4. Solr integration to rescue
5. A very fast demo
6. Wrap up
7. Questions
11. Current search architecture
So
An HSTQuery
is translated to an
XPath query
Which is delegated to the repository that returns a
JCR NodeIterator
which the HST binds back to
HippoBean's
16. Current search architecture
Reasons:
1. Back in the days when Jackrabbit 1 started, Lucene was at
version 1.4
2. The first JSR-170 spec imposed some very harsh
constraints : A save must result in directly updated search
results
17. Current search architecture
Reasons:
1. Back in the days when Jackrabbit 1 started, Lucene was at
version 1.4
2. The first JSR-170 spec imposed some very harsh
constraints : A save must result in directly updated search
results
3. Support for XPath / SQL was needed. However, Lucene
likes flattened data, JCR with XPath / SQL is all about
hierarchical data
18. Current search architecture
Reasons:
1. Back in the days when Jackrabbit 1 started, Lucene was at
version 1.4
2. The first JSR-170 spec imposed some very harsh
constraints : A save must result in directly updated search
results
3. Support for XPath / SQL was needed. However, Lucene
likes flattened data, JCR with XPath / SQL is all about
hierarchical data
4. JCR Nodes != Documents
19. Outline
1. The current search (HST / repo) architecture
2. The current problems / shortcomings / mismatches
3. What we are trying to improve, the objectives
4. Solr integration to rescue
5. A short HOWTO as developer
6. A very fast demo
7. Wrap up
8. Questions
20. Current problems / shortcomings /
mismatches
1. JCR Nodes are indexed instead of Documents
(#nodes >> #documents)
21. Current problems / shortcomings /
mismatches
1. JCR Nodes are indexed instead of Documents
(#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you
want something else, like auto-completion
22. Current problems / shortcomings /
mismatches
1. JCR Nodes are indexed instead of Documents
(#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you
want something else, like auto-completion
3. Very hard and very limited to customize
23. Current problems / shortcomings /
mismatches
1. JCR Nodes are indexed instead of Documents
(#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you
want something else, like auto-completion
3. Very hard and very limited to customize
4. A single index for an entire workspace
24. Current problems / shortcomings /
mismatches
1. JCR Nodes are indexed instead of Documents
(#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you
want something else, like auto-completion
3. Very hard and very limited to customize
4. A single index for an entire workspace
5. Support for very complex XPath / SQL queries at a price
of CPU, Memory and complexity
25. Current problems / shortcomings /
mismatches
1. JCR Nodes are indexed instead of Documents
(#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you
want something else, like auto-completion
3. Very hard and very limited to customize
4. A single index for an entire workspace
5. Support for very complex XPath / SQL queries at a price
of CPU, Memory and complexity
6. Only JCR Nodes and properties are indexed : no 'derived'
field indexes
26. Current problems / shortcomings /
mismatches
1. JCR Nodes are indexed instead of Documents
(#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you
want something else, like auto-completion
3. Very hard and very limited to customize
4. A single index for an entire workspace
5. Support for very complex XPath / SQL queries at a price
of CPU, Memory and complexity
6. Only JCR Nodes and properties are indexed : no 'derived'
field indexes
7. To index external sources, the sources need to be stored in
the repository
27. Current problems / shortcomings /
mismatches
1. JCR Nodes are indexed instead of Documents
(#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you
want something else, like auto-completion
3. Very hard and very limited to customize
4. A single index for an entire workspace
5. Support for very complex XPath / SQL queries at a price
of CPU, Memory and complexity
6. Only JCR Nodes and properties are indexed : no 'derived'
field indexes
7. To index external sources, the sources need to be stored in
the repository
8. Range queries (and others) easily blow up
28. Current problems / shortcomings /
mismatches
1. JCR Nodes are indexed instead of Documents
(#nodes >> #documents)
2. A search result only returns Nodes (Rows) : what if you
want something else, like auto-completion
3. Very hard and very limited to customize
4. A single index for an entire workspace
5. Support for very complex XPath / SQL queries at a price
of CPU, Memory and complexity
6. Only JCR Nodes and properties are indexed : no 'derived'
field indexes
7. To index external sources, the sources need to be stored in
the repository
8. Range queries (and others) easily blow up
9. Getting the number of hits is complex
29. Current problems / shortcomings /
mismatches
Extra problem
JCR Nodes
!=
Documents
For example : A news document contains a link to an author
document : Through the author name, the news document
should be found
30. Outline
1. The current search (HST / repo) architecture
2. The current problems / shortcomings / mismatches
3. What we are trying to improve, the objectives
4. Solr integration to rescue
5. A very fast demo
6. Wrap up
7. Questions
31. Objectives
1. Fix all the 9+ problems / shortcomings/ mismatches from
previous slides
2. Easy to use and customize
3. Satisfied customers
4. Satisfied partners
5. Scalable searches : CPU, memory and large document
numbers
6. Document oriented
7. Integration with HST ContentBeans (HippoBeans)
8. Index external sources
9. Control the SIZE of the index yourself
10. Don't invent but integrate ( with out-of-the-box features
supported by a large community)
32. Objective: Fix all the 9 problems /
shortcomings/ mismatches from
previous slides
33. Objective: Fix all the 9 problems /
shortcomings/ mismatches from
previous slides
Easy:
Solr integration to rescue
40. Objective: Easy to use and
customize
You decide 'from where', 'what', 'how' and 'when' to index
41. Objective: Easy to use and
customize
You decide 'from where', 'what', 'how' and 'when' to index
1. from where: which sources (jcr, webpages, database,
noSQL store, nuxeo, alfresco, anything)
42. Objective: Easy to use and
customize
You decide 'from where', 'what', 'how' and 'when' to index
1. from where: which sources (jcr, webpages, database,
noSQL store, nuxeo, alfresco, anything)
2. what : which parts of a document (not jcr node) or external
source
43. Objective: Easy to use and
customize
You decide 'from where', 'what', 'how' and 'when' to index
1. from where: which sources (jcr, webpages, database,
noSQL store, nuxeo, alfresco, anything)
2. what : which parts of a document (not jcr node) or external
source
3. how :
1. which analyzer,
2. index on document level, property level or both
3. store the text
44. Objective: Easy to use and
customize
You decide 'from where', 'what', 'how' and 'when' to index
1. from where: which sources (jcr, webpages, database,
noSQL store, nuxeo, alfresco, anything)
2. what : which parts of a document (not jcr node) or external
source
3. how :
1. which analyzer,
2. index on document level, property level or both
3. store the text
4. when : when do you want to index
45. Objective: Easy to use and
customize
But of course, out-of-the-box support and tooling
ready to be used by YOU
46. Objective: Easy to use and
customize
But of course, out-of-the-box support and tooling
ready to be used by YOU
1. Default hippo repository indexer & observer
47. Objective: Easy to use and
customize
But of course, out-of-the-box support and tooling
ready to be used by YOU
1. Default hippo repository indexer & observer
2. ContentBean (HippoBean) annotations for indexing
48. Objective: Easy to use and
customize
But of course, out-of-the-box support and tooling
ready to be used by YOU
1. Default hippo repository indexer & observer
2. ContentBean (HippoBean) annotations for indexing
3. Binding search results to ContentBean's
49. Objective: Easy to use and
customize
But of course, out-of-the-box support and tooling
ready to be used by YOU
1. Default hippo repository indexer & observer
2. ContentBean (HippoBean) annotations for indexing
3. Binding search results to ContentBean's
4. Deployment support
50. Objective: Easy to use and
customize
But of course, out-of-the-box support and tooling
ready to be used by YOU
1. Default hippo repository indexer & observer
2. ContentBean (HippoBean) annotations for indexing
3. Binding search results to ContentBean's
4. Deployment support
5. Clustering support
55. Objective: Satisfied customers
If they are not satisfied enough you can:
1. Easily customize it (aka tune it until 'je een ons weegt')
2. Hire anyone with Solr experience : All our partners have
Solr experience
56. Objective: Satisfied customers
Still not satisfied?
Let them pay too much for a Google Search appliance,
Autonomy or any of the other 'useless to pay for software'
60. Objective: Satisfied partners
1. Our partners frequently have good knowledge about Solr
2. Our partners depend less on the current search limitations
61. Objective: Satisfied partners
1. Our partners frequently have good knowledge about Solr
2. Our partners depend less on the current search limitations
3. Our partners can pitch with their Solr knowledge
62. Objective: Satisfied partners
1. Our partners frequently have good knowledge about Solr
2. Our partners depend less on the current search limitations
3. Our partners can pitch with their Solr knowledge
4. Our partners can sell more Hippo implementations
63. Objective: Satisfied partners
1. Our partners frequently have good knowledge about Solr
2. Our partners depend less on the current search limitations
3. Our partners can pitch with their Solr knowledge
4. Our partners can sell more Hippo implementations
5. Our partners will earn more on Hippo and have happier
developers
64. Objective: Satisfied partners
1. Our partners frequently have good knowledge about Solr
2. Our partners depend less on the current search limitations
3. Our partners can pitch with their Solr knowledge
4. Our partners can sell more Hippo implementations
5. Our partners will earn more on Hippo and have happier
developers
6. Hippo will earn more through HES: Which will satisfy
partners again, because Hippo can spend more on AR&D
==> more features
68. Objective: Scalable searches
1. Using Solr to do the searches
2. Not the complex JCR hierarchical searches
3. Document oriented instead of JCR Nodes ( #docs <<
#nodes)
79. Objective: Integration with
ContentBeans (HippoBeans)
Annotate your getters with
@IndexField
or
@IndexField(name="foo")
And account for them in Solr schema.xml
<field name="title" type="text_general" indexed="true" stored="true" />
<field name="summary" type="text_general" indexed="true" stored="true"/>
80. Objective: Integration with
ContentBeans (HippoBeans)
An example:
@Node(jcrType="demosite:textdocument")
public class TextBean extends BaseDocument {
@IndexField
public String getTitle() {
return getProperty("demosite:title") ;
}
@IndexField(name="samenvatting")
public String getSummary() {
return getProperty("demosite:summary") ;
}
}
81. Objective: Integration with
ContentBeans (HippoBeans)
Another example:
@Node(jcrType="demosite:textdocument")
public class TextBean extends BaseDocument {
@IndexField
public String getTitle() {
return getProperty("demosite:title") ;
}
@IndexField
public String getSummary() {
return getProperty("demosite:summary") ;
}
@IndexField
public String getAuthor() {
return getLinkedBean("demosite:author", Author.class). etAuthor();
g
}
}
82. Objective: Integration with
ContentBeans (HippoBeans)
Another example:
@Node(jcrType="demosite:textdocument")
public class TextBean extends BaseDocument {
@IndexField
public String getTitle() {
return getProperty("demosite:title") ;
}
@IndexField
public String getSummary() {
return getProperty("demosite:summary") ;
}
@ReIndexOnChange
@IndexField
public Author getAuthor() {
return getLinkedBean("demosite:author", Author.class);
}
}
83. Objective: Integration with
ContentBeans (HippoBeans)
Another example: Setters
@Node(jcrType="demosite:textdocument")
public class TextBean extends BaseDocument {
private String title;
private String summary;
@IndexField
public String getTitle() {
return title == null ? getProperty("demosite:title"): title ;
}
public void setTitle(String title) {
this.title = title;
}
@IndexField
public String getSummary() {
return summary == null ? getProperty("demosite:summary"): summary ;
}
public void setSummary(String summary) {
this.summary = summary;
}
}
Bonus : What can we achieve with the Setters?
84. Objective: Integration with
ContentBeans (HippoBeans)
That's all you need to do
And the HST binds some extra indexing fields like
1. The path
2. The canonicalUUID
3. The name
4. The localized name
5. The depth
6. The class hierarchy (including interfaces)
87. Objective: Index external sources
You can
1. Push them directly to Solr
2. Push them to a HST JAX-RS resource that binds to a
ContentBean and commits to Solr
88. Objective: Index external sources
You can
1. Push them directly to Solr
2. Push them to a HST JAX-RS resource that binds to a
ContentBean and commits to Solr
3. Crawl from the HST and bind to ContentBeans and commit
them to Solr
89. Objective: Index external sources
A ContentBean does *not* need a JCR Node!
ContentBean interface:
public interface ContentBean {
@IndexField(name="id")
String getPath();
void setPath(String path);
}
90. Objective: Index external sources
An example : GoGreenProductBean in Testsuite
public class GoGreenProductBean implements ContentBean {
private String path;
private String title;
private String summary;
private String description;
public String getPath() {return path;}
public void setPath(final String path) {this.path = path;}
@IndexField
public String getTitle() {return title;}
public void setTitle(String title) {this.title = title;}
@IndexField
public String getSummary() {return summary ;}
public void setSummary(String summary) {this.summary = summary;}
@IndexField
public String getDescription() {return description;}
public void setDescription(String description) {this.description = description;}
}
91. Objective: Index external sources
And add the GoGreenProductBean to Solr
{
List<GoGreenProductBean> gogreenBeans = new ArrayList<GoGreenProductBean>();
// FILL THE gogreenBeans LIST
// NOW ADD TO INDEX
HippoSolrManager solrManager =
HstServices.getComponentManager().getComponent(
HippoSolrManager.class.getName(), SOLR_MODULE_NAME);
try {
solrManager.getSolrServer().addBeans(gogreenBeans);
UpdateResponse commit = solrManager.getSolrServer().commit();
} catch (IOException e) {
e.printStackTrace();
} catch (SolrServerException e) {
e.printStackTrace();
}
}
93. Objective: Control the SIZE of the
index yourself
JCR / Jackrabbit / Hippo-Repository has a generic
one-fits-all-index (or one-fits-none-index)
Which grows very large easily, and can hardly be customized
94. Objective: Control the SIZE of the
index yourself
However, search is
domain specific
Thus,
Just index what is needed
for the customer
97. Objective: Don't invent but integrate
For example:
HippoSolrManager solrManager = ...
String query = ...
HippoQuery hippoQuery = solrManager.createQuery(query);
hippoQuery.setLimit(pageSize);
hippoQuery.setOffset((page - 1) * pageSize);
// hippoQuery.getSolrQuery() is the SolrQuery object
// include scoring
hippoQuery.getSolrQuery().setIncludeScore(true);
hippoQuery.getSolrQuery().setHighlight(true);
hippoQuery.getSolrQuery().setHighlightFragsize(200);
hippoQuery.getSolrQuery().addHighlightField("title");
hippoQuery.getSolrQuery().addHighlightField("summary");
hippoQuery.getSolrQuery().addHighlightField("htmlContent");
HippoQueryResult result = hippoQuery.execute(true);
98. Objective: Don't invent but integrate
For example:
HippoSolrManager solrManager = ...
String query = ...
HippoQuery hippoQuery = solrManager.createQuery(query);
hippoQuery.setLimit(pageSize);
hippoQuery.setOffset((page - 1) * pageSize);
// hippoQuery.getSolrQuery() is the SolrQuery object
// include scoring
hippoQuery.getSolrQuery().setIncludeScore(true);
hippoQuery.getSolrQuery().setHighlight(true);
hippoQuery.getSolrQuery().setHighlightFragsize(200);
hippoQuery.getSolrQuery().addHighlightField("title");
hippoQuery.getSolrQuery().addHighlightField("summary");
hippoQuery.getSolrQuery().addHighlightField("htmlContent");
HippoQueryResult result = hippoQuery.execute(true);
99. Outline
1. The current search (HST / repo) architecture
2. The current problems / shortcomings / mismatches
3. What we are trying to improve, the objectives
4. Solr integration to rescue
5. A very fast demo
6. Wrap up
7. Questions
101. Outline
1. The current search (HST / repo) architecture
2. The current problems / shortcomings / mismatches
3. What we are trying to improve, the objectives
4. Solr integration to rescue
5. A very fast demo
6. Wrap up
7. Questions
102. A very fast demo
setup
~75.000 long wikipedia docs in repository
............... doing the demo .................
104. Outline
1. The current search (HST / repo) architecture
2. The current problems / shortcomings / mismatches
3. What we are trying to improve, the objectives
4. Solr integration to rescue
5. A very fast demo
6. Wrap up
7. Questions
106. Wrap up
I think that with the Solr integration
1. Developers will be happier
107. Wrap up
I think that with the Solr integration
1. Developers will be happier
2. Customers will be happier
108. Wrap up
I think that with the Solr integration
1. Developers will be happier
2. Customers will be happier
3. Partners will be happier
109. Wrap up
I think that with the Solr integration
1. Developers will be happier
2. Customers will be happier
3. Partners will be happier
4. Hippo will be happier
110. Wrap up
I think that with the Solr integration
1. Developers will be happier
2. Customers will be happier
3. Partners will be happier
4. Hippo will be happier
And finally, last and least
111. Wrap up
I think that with the Solr integration
1. Developers will be happier
2. Customers will be happier
3. Partners will be happier
4. Hippo will be happier
5. Infra will be happier because the servers stop sweating
112. Outline
1. The current search (HST / repo) architecture
2. The current problems / shortcomings / mismatches
3. What we are trying to improve, the objectives
4. Solr integration to rescue
5. A very fast demo
6. Wrap up
7. Questions
113. Questions?
Check out the example at :
http://svn.onehippo.org/repos/hippo/hippo-cms7/testsuite/trunk