Hippo get together presentation solr integration

Solr integration

April 20, 2012
Ard Schrijvers • a.schrijvers@onehippo.com /
ard@apache.org

About me:
Ard Schrijvers

1. Working at Hippo since 2001
2. Email: a.schrijvers@onehippo.com
ard@apache.org
3. Worked primarily on:
1. HST
2. Hippo Repository / Jackrabbit
3. Lucene
4. Cocoon
5. Slide
4. Apache committer of Jackrabbit and Cocoon

Outline

1. The current search (HST / repo) architecture

Outline

2. The current problems / shortcomings / mismatches

Outline

3. What we are trying to improve, the objectives

Outline

4. Solr integration to rescue

Outline

5. A very fast demo

Outline

5. A very fast demo
6. Wrap up

Outline

5. A very fast demo
6. Wrap up
7. Questions

Current search architecture

So
An HSTQuery
is translated to an
XPath query
Which is delegated to the repository that returns a
JCR NodeIterator
which the HST binds back to
HippoBean's


That sounds doable and not to complex

is it?


Well, it is .......


Well, it is ....... very complex


Reasons:

1. Back in the days when Jackrabbit 1 started, Lucene was at
version 1.4


Reasons:

version 1.4
2. The first JSR-170 spec imposed some very harsh
constraints : A save must result in directly updated search
results


Reasons:

version 1.4
results
3. Support for XPath / SQL was needed. However, Lucene
likes flattened data, JCR with XPath / SQL is all about
hierarchical data


Reasons:

version 1.4
results
3. Support for XPath / SQL was needed. However, Lucene
likes flattened data, JCR with XPath / SQL is all about
hierarchical data
4. JCR Nodes != Documents

Outline

5. A short HOWTO as developer
6. A very fast demo
7. Wrap up
8. Questions

Current problems / shortcomings /
mismatches
1. JCR Nodes are indexed instead of Documents
(#nodes >> #documents)

mismatches
2. A search result only returns Nodes (Rows) : what if you
want something else, like auto-completion

mismatches
3. Very hard and very limited to customize

mismatches
4. A single index for an entire workspace

mismatches
5. Support for very complex XPath / SQL queries at a price
of CPU, Memory and complexity

mismatches
6. Only JCR Nodes and properties are indexed : no 'derived'
field indexes

mismatches
field indexes
7. To index external sources, the sources need to be stored in
the repository

mismatches
field indexes
the repository
8. Range queries (and others) easily blow up

mismatches
field indexes
the repository
8. Range queries (and others) easily blow up
9. Getting the number of hits is complex

mismatches
Extra problem

JCR Nodes
!=
Documents

For example : A news document contains a link to an author
document : Through the author name, the news document
should be found

Objectives

1. Fix all the 9+ problems / shortcomings/ mismatches from
previous slides
2. Easy to use and customize
3. Satisfied customers
4. Satisfied partners
5. Scalable searches : CPU, memory and large document
numbers
6. Document oriented
7. Integration with HST ContentBeans (HippoBeans)
8. Index external sources
9. Control the SIZE of the index yourself
10. Don't invent but integrate ( with out-of-the-box features
supported by a large community)

Objective: Fix all the 9 problems /
shortcomings/ mismatches from
previous slides

Objective: Fix all the 9 problems /
shortcomings/ mismatches from
previous slides
Easy:

Solr integration to rescue

Objective: Easy to use and
customize

customize

YOU will be in the driver seat

customize
No more complete dependence on what the sometimes not so
smAR&D Hippo team thought was good for YOU

Objective : Easy to use and
customize

customize
You decide 'from where', 'what', 'how' and 'when' to index

customize
1. from where: which sources (jcr, webpages, database,
noSQL store, nuxeo, alfresco, anything)

customize
2. what : which parts of a document (not jcr node) or external
source

customize
source
3. how :
1. which analyzer,
2. index on document level, property level or both
3. store the text

customize
source
3. how :
1. which analyzer,
2. index on document level, property level or both
3. store the text
4. when : when do you want to index

customize

But of course, out-of-the-box support and tooling
ready to be used by YOU

customize


1. Default hippo repository indexer & observer

customize


2. ContentBean (HippoBean) annotations for indexing

customize


3. Binding search results to ContentBean's

customize


4. Deployment support

customize


4. Deployment support
5. Clustering support

Objective: Satisfied customers


HOW?


EASY


Most likely they just will be satisfied


If they are not satisfied enough you can:

1. Easily customize it (aka tune it until 'je een ons weegt')
2. Hire anyone with Solr experience : All our partners have
Solr experience


Still not satisfied?

Let them pay too much for a Google Search appliance,
Autonomy or any of the other 'useless to pay for software'

Objective: Satisfied partners

Although on thin ice here, I strongly believe in this because:


1. Our partners frequently have good knowledge about Solr


2. Our partners depend less on the current search limitations


3. Our partners can pitch with their Solr knowledge


4. Our partners can sell more Hippo implementations


5. Our partners will earn more on Hippo and have happier
developers


5. Our partners will earn more on Hippo and have happier
developers
6. Hippo will earn more through HES: Which will satisfy
partners again, because Hippo can spend more on AR&D
==> more features

Objective: Scalable searches

1. Using Solr to do the searches


2. Not the complex JCR hierarchical searches


2. Not the complex JCR hierarchical searches
3. Document oriented instead of JCR Nodes ( #docs <<
#nodes)

Objective: Document oriented

What do we want to search for?


Exactly,

Documents!!


A Document
==
A HippoBean
!=
JCR Node


So let's index


So let's index

HippoBeans
(ContentBeans)

Objective: Integration with
ContentBeans (HippoBeans)

As a developer ....

how am I going to index my beans?


I know how to write HippoBeans, that all I ever did in my life


How do you expect me to index my beans?

Annotate your getters with

@IndexField
or
@IndexField(name="foo")

And account for them in Solr schema.xml
<field name="title" type="text_general" indexed="true" stored="true" />
<field name="summary" type="text_general" indexed="true" stored="true"/>

An example:
@Node(jcrType="demosite:textdocument")
public class TextBean extends BaseDocument {

@IndexField
public String getTitle() {
return getProperty("demosite:title") ;
}
@IndexField(name="samenvatting")
public String getSummary() {
return getProperty("demosite:summary") ;
}
}

Another example:

@IndexField
}
@IndexField
}

@IndexField
public String getAuthor() {
return getLinkedBean("demosite:author", Author.class). etAuthor();
g
}
}

Another example:

@IndexField
}
@IndexField
}

@ReIndexOnChange
@IndexField
public Author getAuthor() {
return getLinkedBean("demosite:author", Author.class);
}
}

Another example: Setters
private String title;
private String summary;

@IndexField
return title == null ? getProperty("demosite:title"): title ;
}
public void setTitle(String title) {
this.title = title;
}
@IndexField
return summary == null ? getProperty("demosite:summary"): summary ;
}
public void setSummary(String summary) {
this.summary = summary;
}
}
Bonus : What can we achieve with the Setters?

That's all you need to do

And the HST binds some extra indexing fields like

1. The path
2. The canonicalUUID
3. The name
4. The localized name
5. The depth
6. The class hierarchy (including interfaces)

Objective: Index external sources


You can

1. Push them directly to Solr


You can

2. Push them to a HST JAX-RS resource that binds to a
ContentBean and commits to Solr


You can

2. Push them to a HST JAX-RS resource that binds to a
ContentBean and commits to Solr
3. Crawl from the HST and bind to ContentBeans and commit
them to Solr


A ContentBean does *not* need a JCR Node!

ContentBean interface:

public interface ContentBean {
@IndexField(name="id")
String getPath();
void setPath(String path);
}


An example : GoGreenProductBean in Testsuite
public class GoGreenProductBean implements ContentBean {

private String path;
private String title;
private String summary;
private String description;

public String getPath() {return path;}
public void setPath(final String path) {this.path = path;}
@IndexField
public String getTitle() {return title;}
public void setTitle(String title) {this.title = title;}
@IndexField
public String getSummary() {return summary ;}
public void setSummary(String summary) {this.summary = summary;}
@IndexField
public String getDescription() {return description;}
public void setDescription(String description) {this.description = description;}
}


And add the GoGreenProductBean to Solr
{
List<GoGreenProductBean> gogreenBeans = new ArrayList<GoGreenProductBean>();
// FILL THE gogreenBeans LIST

// NOW ADD TO INDEX
HippoSolrManager solrManager =
HstServices.getComponentManager().getComponent(
HippoSolrManager.class.getName(), SOLR_MODULE_NAME);
try {
solrManager.getSolrServer().addBeans(gogreenBeans);
UpdateResponse commit = solrManager.getSolrServer().commit();
} catch (IOException e) {
e.printStackTrace();
} catch (SolrServerException e) {
e.printStackTrace();
}
}

Objective: Control the SIZE of the
index yourself

index yourself
JCR / Jackrabbit / Hippo-Repository has a generic

one-fits-all-index (or one-fits-none-index)

Which grows very large easily, and can hardly be customized

index yourself
However, search is

domain specific

Thus,

Just index what is needed
for the customer

Objective: Don't invent but integrate


Use Solr

Use Solrj client

Expose the Solrj SolrQuery


For example:
HippoSolrManager solrManager = ...
String query = ...
HippoQuery hippoQuery = solrManager.createQuery(query);
hippoQuery.setLimit(pageSize);
hippoQuery.setOffset((page - 1) * pageSize);

// hippoQuery.getSolrQuery() is the SolrQuery object
// include scoring

hippoQuery.getSolrQuery().setIncludeScore(true);
hippoQuery.getSolrQuery().setHighlight(true);
hippoQuery.getSolrQuery().setHighlightFragsize(200);
hippoQuery.getSolrQuery().addHighlightField("title");
hippoQuery.getSolrQuery().addHighlightField("summary");
hippoQuery.getSolrQuery().addHighlightField("htmlContent");

HippoQueryResult result = hippoQuery.execute(true);

Solr integration to rescue

No further comments :-)

A very fast demo

setup
~75.000 long wikipedia docs in repository

............... doing the demo .................

Wrap up

I think that with the Solr integration

Wrap up


1. Developers will be happier

Wrap up


2. Customers will be happier

Wrap up


3. Partners will be happier

Wrap up


4. Hippo will be happier

Wrap up



And finally, last and least

Wrap up


5. Infra will be happier because the servers stop sweating

Questions?

Check out the example at :
http://svn.onehippo.org/repos/hippo/hippo-cms7/testsuite/trunk

Hippo get together presentation solr integration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hippo get together presentation solr integration

Similar to Hippo get together presentation solr integration (20)

Recently uploaded

Recently uploaded (20)

Hippo get together presentation solr integration