SlideShare a Scribd company logo
1 of 62
Download to read offline
HIGH PERFORMANCE JSON SEARCH AND
RELATIONAL FACETED BROWSING WITH LUCENE

Renaud Delbru
renaud@sindicetech.com
renaud.delbru@deri.org

Co-Founder, SindiceTech
Post-Doctoral Researcher, NUIG
My Background
•

•

•

Lucene / Solr
– User since 7 years
– Built a web search engine – sindice.com (700M documents)
Academia & Research
– Ph.D. in Information Retrieval and Semantic Web
– Post-doctoral researcher at National Univerity of Ireland, Galway
Industry
– Technical co-founder of SindiceTech
– Management Platform for Enterprise Knowledge Graph
Agenda
•
•
•
•
•

Nested Data Model
SIREn Overview & Theory
SIREn Plugin Architecture
Relational Faceted Browsing
Comparison with BlockJoin
Nested Data Model: Why is it important ?
•
•

SQL
– Query-time join
performance penalty
NoSQL
– Denormalisation of relational data into nested data
– Convert many-to-one/many into one-to-many relationships
Denormalising Relational Data

Series A
Granite
Ventures

LucidWorks
Series B
Denormalising Relational Data

Series A

Granite
Ventures

Series B

Granite
Ventures

LucidWorks
Nested Data Model: Why is it important ?
•
•

SQL
– Query-time join
performance penalty
NoSQL
– Denormalisation of relational data into nested data
– Convert many-to-one/many into one-to-many relationships
– Duplicate data …
– … but avoid joins
Schema-Less Nested Data Model
•

•

•

Model becoming prevalent: JSON, XML, Avro, …
– Can be arbitrarily nested and large
– No strict schema / structure enforced
Schema-less brings
– Flexibility
– Ease of development
Developers do not have to invest significant modelling effort upfront
Introducing SIREn
•
•
•

Lucene/Solr plugin for indexing and searching JSON
Rich data model (JSON)
– Nested objects, nested arrays, datatypes
Schema-agnostic
– No need to define structure (nested model)
– No need to define schema (fields)
Overview of the SIREn API
Document

Query

{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
"round_code" : "a",
"raised_amount" : 6000000,
"funded_year" : 2009,
"investments" : [
{
"name" : "Granite Ventures",
"type" : "financial-org"
},
…
]
},
…
]
}

(category_code :

analytics)

AND
(funding_rounds : {
round_code : seed OR a OR angel,
raised_amount : [0 TO 12000000],
* : {
type : financial-org
}
})
Theory behind SIREn
•
•
•

Inspired from tree-labelling scheme techniques (XML IR)
– Label each node with a hierarchical ids (here Dewey’s identifiers)
Full-text search operators over the content of a node
Structural search operators over the nodes of the tree
– Ancestor-Descendant, Parent-Child, Sibling, …
Theory behind SIREn: Tree-Labelling

{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
"round_code" : "a",
"raised_amount" : 6000000,
"funded_year" : 2009,
…
},
…
]
}

name

LucidWorks

funding_
rounds
round_
code

a

raised_
amount

6000000

…
Theory behind SIREn: Tree-Labelling

1

{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
"round_code" : "a",
"raised_amount" : 6000000,
"funded_year" : 2009,
…
},
…
]
}

name

LucidWorks

1.1

1.1.1

funding_
rounds

1.2

1.2.1
round_
code

a

1.2.2.1

1.2.2.1.1

raised_
amount

6000000

1.2.2.2
…
1.2.2

1.2.2.2.1
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

2.2.1

4.2.1
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

2.2.1

4.2.1
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

2.2.1

4.2.1
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

2.2.1

4.2.1
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

2.2.1

4.2.1
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

2.2.1

4.2.1
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

2.2.1

4.2.1
SIREn Plugin Architecture - Overview

Document

Analysis

Flexible Query Parser

JSON Query Parser
Query

JSON Analyzer

Node Query

Codec
Tree-Labelling Codec

Legend:

Lucene

SIREn
JSON Field

<fields>
<field name="id" type="string" indexed="true" stored="true"/>
<field name="json" type="json" indexed="true" stored="false"/>
…
</fields>
<types>
<fieldType name="json"
class="org.sindice.siren.solr.schema.JsonField"
datatypeConfig="datatypes.xml"/>
…
</types>

schema.xml sample
Datatypes
<datatype name="http://www.w3.org/2001/XMLSchema#String"
class="org.sindice.siren.solr.schema.TextDatatype">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</datatype>
<datatype name="http://www.w3.org/2001/XMLSchema#int"
class="org.sindice.siren.solr.schema.TrieDatatype"
precisionStep="8"
type="integer"/>

datatypes.xml sample
JSON Tokenizer
•
•
•

Traverses JSON tree using Depth-First
Search
Generates one token per JSON node
Attaches metadata attributes (Dewey id,
datatype, …) to each token

Tokenizer Output
name
1.1
Field

LucidWorks
1.1.1
String

funding_
rounds
1.2
Field

round_
code
1.2.2.1
String

…
JSON Analyzer – NodeTokenizerFilter
•

Tokenize the content of a node token based on its datatype

Input
name
1.1
Field

funding_
rounds
1.2
Field

LucidWorks
1.1.1
String

round_
code
1.2.2.1
String

…

Output
name

funding_
rounds

LucidWorks

lucid

works

funding

…

rounds
JSON Analyzer – NodeTokenizerFilter
•

Tokenize the content of a node token based on its datatype

Input
name
1.1
Field

funding_
rounds
1.2
Field

LucidWorks
1.1.1
String

round_
code
1.2.2.1
String

…

Output
name

funding_
rounds

LucidWorks

lucid

works

Tokenized with String
datatype analyzer

funding

…

rounds
JSON Analyzer – NodeTokenizerFilter
•

Tokenize the content of a node token based on its datatype

Input
name
1.1
Field

funding_
rounds
1.2
Field

LucidWorks
1.1.1
String

round_
code
1.2.2.1
String

…

Output
name

funding_
rounds

LucidWorks

lucid

works

funding

…

rounds

Tokenized with Field
datatype analyzer
JSON Analyzer – NodePayloadFilter
•
•

Encode metadata attributes into a term payload
Leverage Payload API to transfer attributes to the Codec API
SIREn Plugin Architecture - Overview

Document

Analysis

Flexible Query Parser

JSON Query Parser
Query

JSON Analyzer

Node Query

Codec
Tree-Labelling Codec

Legend:

Lucene

SIREn
Tree-Labelling Codec – File Structure

Block

.doc

Header

Doc identifiers

Node frequencies

.nod

Header

Node identifiers

Term frequencies

.pos

Header

Term positions
Tree-Labelling Codec – Compression
•

Adaptive Frame Of Reference
– Adapt the encoding to the integer distribution
– Better tolerance against outliers
– Very effective with frequencies, node identifiers and positions (higher
compression rate)

FOR

BFS

AFOR

BFS

BFS

BFS

BFS
SIREn Plugin Architecture - Overview

Document

Analysis

Flexible Query Parser

JSON Query Parser
Query

JSON Analyzer

Node Query

Codec
Tree-Labelling Codec

Legend:

Lucene

SIREn
Node Query
•

•

Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
Node Query
•

•

•

Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
TwigQuery
– Consist of a root query and one or
more descendant or child queries

Boolean

Phrase
MUST

Boolean
SHOULD
Node Query
•

•

•

Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
TwigQuery
– Consist of a root query and one or
more descendant or child queries

Boolean

Phrase
MUST

Boolean
SHOULD
Node Query
•

•

•

Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
TwigQuery
– Consist of a root query and one or
more descendant or child queries
– Can be nested to form complex tree
structure

Boolean

Phrase

Twig

MUST

NOT

Boolean

Range

SHOULD

SHOULD
Node Query
•

•

•

Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
TwigQuery
– Consist of a root query and one or
more descendant or child queries
– Can be nested to form complex tree
structure
– Can be rewritten as a pure boolean
query

Boolean

Phrase

Twig

MUST

NOT

Boolean

Range

SHOULD

SHOULD
Application: Relational Faceted Navigation
•

•

Faceted Navigation
– Data-driven exploratory interface
– User incrementally adds constraints
– Restricted to one record collection
Relational Faceted Navigation
– Enables navigation of interrelated record collections
– Constraints affect all record collections
– New navigation operation: Pivot
• Switch user view to a record collection
Relational Faceted Navigation – Demo

HCLS Demo: http://hcls.sindice.com/pivot-browser/
Data Model
•
•
•

Each collection has its own data model (document)
Lucene fields for facets
JSON field for relationships with records from other collections

Company

Investment

Investor

Country

Year

Type

Category

Amount

JSON

JSON

JSON
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company

Investment

Investor

category_
code

round_
code

type

country_
code
funding_
rounds

raised_
amount

investments -1

funding_
rounds -1

[…]

category_
code

round_
code
raised_
amount
investments

[…]

country_
code

[…]
type

investments

[…]
type

[…]
round_
code
raised_
amount
funding_
rounds -1

[…]
category_
code
country_
code
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company

Investment

category_
code
country_
code
funding_
rounds

[…]
round_
code
raised_
amount
investments

[…]
type

Investor
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company

Investment

category_
code
country_
code
funding_
rounds

[…]
round_
code
raised_
amount
investments

[…]
type

Investor
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company

Investment
round_
code
raised_
amount
funding_
rounds -1

[…]
category_
code

country_
code
investments

[…]
type

Investor
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company

Investment
round_
code
raised_
amount
funding_
rounds -1

[…]
category_
code

country_
code
investments

[…]
type

Investor
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company

Investment

Investor

type
investments -1

[…]
round_
code
raised_
amount
funding_
rounds -1

[…]
category_
code
country_
code
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company

Investment

Investor

type
investments -1

[…]
round_
code
raised_
amount
funding_
rounds -1

[…]
category_
code
country_
code
Navigation Model : Drill-Down
Navigation Model: Drill-Down

collection : Company
AND
country_code : irl
AND
category_code : software

Lucene query
Navigation Model: Pivot
Navigation Model: Pivot

collection : Investment

Lucene query
Navigation Model: Pivot

collection : Investment

Query Rewriting
collection : Company
AND
country_code : irl
AND
category_code : software

Preceding Lucene query

Lucene query

funding_rounds -1 : {
country_code : irl,
category_code : software
}

JSON query
Navigation Model: Pivot

collection : Investment

Lucene query
funding_rounds -1 : {
country_code : irl,
category_code : software
}

JSON query
Navigation Model: Pivot
Navigation Model: Pivot

collection : Investor

Lucene query
investments -1 : {
founded_year : 2012,
funding_rounds -1 : {
country_code : irl,
category_code : software
}
}

JSON query
Comparison with BlockJoin
•

Lucene BlockJoin
– Introduced support for indexing and searching nested data …
– … for small and well-defined schema
Lucene BlockJoin - Scalability
•
•

Increase artificially the number of documents in the index
– One document per nested data record
Cache size linear with the number of nested data records
– Increased memory usage
Lucene BlockJoin - Flexibility
•

•

•

Developers must be aware of the relations between nested data records
– At indexing time to tag parent records
– At querying time to filter parent records
Upfront effort required to design and configure the system
– Define Parent-Child relationships between record collections
– Define attributes for each record collection
If not properly designed, risk of incorrect matches
Comparison with BlockJoin
•

•

BlockJoin
+ Works out of the box with all Lucene’s features
‒ Requires upfront design effort
‒ Memory usage dependent on nested data structure
Tree-Labelling
+ Can handle arbitrary and large nested model
+ Memory friendly
‒ Have to re-think and re-implement Lucene’s features
Conclusion
•
•
•
•
•

Nested data model becomes more and more prevalent
Searching nested data brings new challenges: performance, scalability, flexibility
Different approaches exist, each one with pros and cons
SIREn plugin based on tree-labelling techniques
Enables new kind of search applications, e.g., relational faceted browser, with subsecond response time

•

SIREn Availability
– Trial license currently available
– In negotiation with the University to open-source
Acknowledgement
This material is based upon works supported by the European FP7 project LOD2
(257943) and the Irish Research Council for Science, Engineering and Technology.

More Related Content

What's hot

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Erik Hatcher
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conferenceErik Hatcher
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptLucidworks
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 

What's hot (20)

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScript
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 

Viewers also liked

Implementing search with solr at 7digital
Implementing search with solr at 7digitalImplementing search with solr at 7digital
Implementing search with solr at 7digitallucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
Apache SOLR in AEM 6
Apache SOLR in AEM 6Apache SOLR in AEM 6
Apache SOLR in AEM 6Yash Mody
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Lucidworks
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelTrey Grainger
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with ContextSteffen Staab
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6DEEPAK KHETAWAT
 

Viewers also liked (16)

Implementing search with solr at 7digital
Implementing search with solr at 7digitalImplementing search with solr at 7digital
Implementing search with solr at 7digital
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
Apache SOLR in AEM 6
Apache SOLR in AEM 6Apache SOLR in AEM 6
Apache SOLR in AEM 6
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis Panel
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 

Similar to High Performance JSON Search and Faceted Browsing with Lucene

Webinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBWebinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBMongoDB
 
Accelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO WayAccelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO WayMongoDB
 
Graphs fun vjug2
Graphs fun vjug2Graphs fun vjug2
Graphs fun vjug2Neo4j
 
Union catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldapUnion catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldapAAT Taiwan
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsJiaheng Lu
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 
managing big data
managing big datamanaging big data
managing big dataSuveeksha
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseLucidworks
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1medcl
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Adding Search to Relational Databases
Adding Search to Relational DatabasesAdding Search to Relational Databases
Adding Search to Relational DatabasesAmazon Web Services
 
The nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesThe nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesTony Hammond
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
 
Graph Databases in the Microsoft Ecosystem
Graph Databases in the Microsoft EcosystemGraph Databases in the Microsoft Ecosystem
Graph Databases in the Microsoft EcosystemMarco Parenzan
 

Similar to High Performance JSON Search and Faceted Browsing with Lucene (20)

Pinecone Vector Database.pdf
Pinecone Vector Database.pdfPinecone Vector Database.pdf
Pinecone Vector Database.pdf
 
LOD2 Webinar: SIREn
LOD2 Webinar: SIREnLOD2 Webinar: SIREn
LOD2 Webinar: SIREn
 
Webinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBWebinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDB
 
Accelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO WayAccelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO Way
 
Graphs fun vjug2
Graphs fun vjug2Graphs fun vjug2
Graphs fun vjug2
 
Union catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldapUnion catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldap
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing Paradigms
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
managing big data
managing big datamanaging big data
managing big data
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1
 
Oracle by Muhammad Iqbal
Oracle by Muhammad IqbalOracle by Muhammad Iqbal
Oracle by Muhammad Iqbal
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Adding Search to Relational Databases
Adding Search to Relational DatabasesAdding Search to Relational Databases
Adding Search to Relational Databases
 
The nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesThe nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologies
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
MongoDB Basics
MongoDB BasicsMongoDB Basics
MongoDB Basics
 
Graph Databases in the Microsoft Ecosystem
Graph Databases in the Microsoft EcosystemGraph Databases in the Microsoft Ecosystem
Graph Databases in the Microsoft Ecosystem
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...lucenerevolution
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platformlucenerevolution
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucenelucenerevolution
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profitlucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucene
 
10 keys to Solr's Future
10 keys to Solr's Future10 keys to Solr's Future
10 keys to Solr's Future
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
The Typed Index
The Typed IndexThe Typed Index
The Typed Index
 
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
 

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 

High Performance JSON Search and Faceted Browsing with Lucene

  • 1.
  • 2. HIGH PERFORMANCE JSON SEARCH AND RELATIONAL FACETED BROWSING WITH LUCENE Renaud Delbru renaud@sindicetech.com renaud.delbru@deri.org Co-Founder, SindiceTech Post-Doctoral Researcher, NUIG
  • 3. My Background • • • Lucene / Solr – User since 7 years – Built a web search engine – sindice.com (700M documents) Academia & Research – Ph.D. in Information Retrieval and Semantic Web – Post-doctoral researcher at National Univerity of Ireland, Galway Industry – Technical co-founder of SindiceTech – Management Platform for Enterprise Knowledge Graph
  • 4. Agenda • • • • • Nested Data Model SIREn Overview & Theory SIREn Plugin Architecture Relational Faceted Browsing Comparison with BlockJoin
  • 5. Nested Data Model: Why is it important ? • • SQL – Query-time join performance penalty NoSQL – Denormalisation of relational data into nested data – Convert many-to-one/many into one-to-many relationships
  • 6. Denormalising Relational Data Series A Granite Ventures LucidWorks Series B
  • 7. Denormalising Relational Data Series A Granite Ventures Series B Granite Ventures LucidWorks
  • 8. Nested Data Model: Why is it important ? • • SQL – Query-time join performance penalty NoSQL – Denormalisation of relational data into nested data – Convert many-to-one/many into one-to-many relationships – Duplicate data … – … but avoid joins
  • 9. Schema-Less Nested Data Model • • • Model becoming prevalent: JSON, XML, Avro, … – Can be arbitrarily nested and large – No strict schema / structure enforced Schema-less brings – Flexibility – Ease of development Developers do not have to invest significant modelling effort upfront
  • 10. Introducing SIREn • • • Lucene/Solr plugin for indexing and searching JSON Rich data model (JSON) – Nested objects, nested arrays, datatypes Schema-agnostic – No need to define structure (nested model) – No need to define schema (fields)
  • 11. Overview of the SIREn API Document Query { "name" : "LucidWorks", "category_code" : "analytics", "funding_rounds" : [ { "round_code" : "a", "raised_amount" : 6000000, "funded_year" : 2009, "investments" : [ { "name" : "Granite Ventures", "type" : "financial-org" }, … ] }, … ] } (category_code : analytics) AND (funding_rounds : { round_code : seed OR a OR angel, raised_amount : [0 TO 12000000], * : { type : financial-org } })
  • 12. Theory behind SIREn • • • Inspired from tree-labelling scheme techniques (XML IR) – Label each node with a hierarchical ids (here Dewey’s identifiers) Full-text search operators over the content of a node Structural search operators over the nodes of the tree – Ancestor-Descendant, Parent-Child, Sibling, …
  • 13. Theory behind SIREn: Tree-Labelling { "name" : "LucidWorks", "category_code" : "analytics", "funding_rounds" : [ { "round_code" : "a", "raised_amount" : 6000000, "funded_year" : 2009, … }, … ] } name LucidWorks funding_ rounds round_ code a raised_ amount 6000000 …
  • 14. Theory behind SIREn: Tree-Labelling 1 { "name" : "LucidWorks", "category_code" : "analytics", "funding_rounds" : [ { "round_code" : "a", "raised_amount" : 6000000, "funded_year" : 2009, … }, … ] } name LucidWorks 1.1 1.1.1 funding_ rounds 1.2 1.2.1 round_ code a 1.2.2.1 1.2.2.1.1 raised_ amount 6000000 1.2.2.2 … 1.2.2 1.2.2.2.1
  • 15. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  • 16. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  • 17. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  • 18. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  • 19. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  • 20. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  • 21. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  • 22. SIREn Plugin Architecture - Overview Document Analysis Flexible Query Parser JSON Query Parser Query JSON Analyzer Node Query Codec Tree-Labelling Codec Legend: Lucene SIREn
  • 23. JSON Field <fields> <field name="id" type="string" indexed="true" stored="true"/> <field name="json" type="json" indexed="true" stored="false"/> … </fields> <types> <fieldType name="json" class="org.sindice.siren.solr.schema.JsonField" datatypeConfig="datatypes.xml"/> … </types> schema.xml sample
  • 24. Datatypes <datatype name="http://www.w3.org/2001/XMLSchema#String" class="org.sindice.siren.solr.schema.TextDatatype"> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory"/> </analyzer> </datatype> <datatype name="http://www.w3.org/2001/XMLSchema#int" class="org.sindice.siren.solr.schema.TrieDatatype" precisionStep="8" type="integer"/> datatypes.xml sample
  • 25. JSON Tokenizer • • • Traverses JSON tree using Depth-First Search Generates one token per JSON node Attaches metadata attributes (Dewey id, datatype, …) to each token Tokenizer Output name 1.1 Field LucidWorks 1.1.1 String funding_ rounds 1.2 Field round_ code 1.2.2.1 String …
  • 26. JSON Analyzer – NodeTokenizerFilter • Tokenize the content of a node token based on its datatype Input name 1.1 Field funding_ rounds 1.2 Field LucidWorks 1.1.1 String round_ code 1.2.2.1 String … Output name funding_ rounds LucidWorks lucid works funding … rounds
  • 27. JSON Analyzer – NodeTokenizerFilter • Tokenize the content of a node token based on its datatype Input name 1.1 Field funding_ rounds 1.2 Field LucidWorks 1.1.1 String round_ code 1.2.2.1 String … Output name funding_ rounds LucidWorks lucid works Tokenized with String datatype analyzer funding … rounds
  • 28. JSON Analyzer – NodeTokenizerFilter • Tokenize the content of a node token based on its datatype Input name 1.1 Field funding_ rounds 1.2 Field LucidWorks 1.1.1 String round_ code 1.2.2.1 String … Output name funding_ rounds LucidWorks lucid works funding … rounds Tokenized with Field datatype analyzer
  • 29. JSON Analyzer – NodePayloadFilter • • Encode metadata attributes into a term payload Leverage Payload API to transfer attributes to the Codec API
  • 30. SIREn Plugin Architecture - Overview Document Analysis Flexible Query Parser JSON Query Parser Query JSON Analyzer Node Query Codec Tree-Labelling Codec Legend: Lucene SIREn
  • 31. Tree-Labelling Codec – File Structure Block .doc Header Doc identifiers Node frequencies .nod Header Node identifiers Term frequencies .pos Header Term positions
  • 32. Tree-Labelling Codec – Compression • Adaptive Frame Of Reference – Adapt the encoding to the integer distribution – Better tolerance against outliers – Very effective with frequencies, node identifiers and positions (higher compression rate) FOR BFS AFOR BFS BFS BFS BFS
  • 33. SIREn Plugin Architecture - Overview Document Analysis Flexible Query Parser JSON Query Parser Query JSON Analyzer Node Query Codec Tree-Labelling Codec Legend: Lucene SIREn
  • 34. Node Query • • Query Processing – Collects matching document and node identifiers – Posting list traversal order: document ids, node ids then positions Adaptation of all Lucene’s Query classes to the new file structure – NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
  • 35. Node Query • • • Query Processing – Collects matching document and node identifiers – Posting list traversal order: document ids, node ids then positions Adaptation of all Lucene’s Query classes to the new file structure – NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, … TwigQuery – Consist of a root query and one or more descendant or child queries Boolean Phrase MUST Boolean SHOULD
  • 36. Node Query • • • Query Processing – Collects matching document and node identifiers – Posting list traversal order: document ids, node ids then positions Adaptation of all Lucene’s Query classes to the new file structure – NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, … TwigQuery – Consist of a root query and one or more descendant or child queries Boolean Phrase MUST Boolean SHOULD
  • 37. Node Query • • • Query Processing – Collects matching document and node identifiers – Posting list traversal order: document ids, node ids then positions Adaptation of all Lucene’s Query classes to the new file structure – NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, … TwigQuery – Consist of a root query and one or more descendant or child queries – Can be nested to form complex tree structure Boolean Phrase Twig MUST NOT Boolean Range SHOULD SHOULD
  • 38. Node Query • • • Query Processing – Collects matching document and node identifiers – Posting list traversal order: document ids, node ids then positions Adaptation of all Lucene’s Query classes to the new file structure – NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, … TwigQuery – Consist of a root query and one or more descendant or child queries – Can be nested to form complex tree structure – Can be rewritten as a pure boolean query Boolean Phrase Twig MUST NOT Boolean Range SHOULD SHOULD
  • 39. Application: Relational Faceted Navigation • • Faceted Navigation – Data-driven exploratory interface – User incrementally adds constraints – Restricted to one record collection Relational Faceted Navigation – Enables navigation of interrelated record collections – Constraints affect all record collections – New navigation operation: Pivot • Switch user view to a record collection
  • 40. Relational Faceted Navigation – Demo HCLS Demo: http://hcls.sindice.com/pivot-browser/
  • 41. Data Model • • • Each collection has its own data model (document) Lucene fields for facets JSON field for relationships with records from other collections Company Investment Investor Country Year Type Category Amount JSON JSON JSON
  • 42. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment Investor category_ code round_ code type country_ code funding_ rounds raised_ amount investments -1 funding_ rounds -1 […] category_ code round_ code raised_ amount investments […] country_ code […] type investments […] type […] round_ code raised_ amount funding_ rounds -1 […] category_ code country_ code
  • 43. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment category_ code country_ code funding_ rounds […] round_ code raised_ amount investments […] type Investor
  • 44. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment category_ code country_ code funding_ rounds […] round_ code raised_ amount investments […] type Investor
  • 45. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment round_ code raised_ amount funding_ rounds -1 […] category_ code country_ code investments […] type Investor
  • 46. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment round_ code raised_ amount funding_ rounds -1 […] category_ code country_ code investments […] type Investor
  • 47. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment Investor type investments -1 […] round_ code raised_ amount funding_ rounds -1 […] category_ code country_ code
  • 48. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment Investor type investments -1 […] round_ code raised_ amount funding_ rounds -1 […] category_ code country_ code
  • 49. Navigation Model : Drill-Down
  • 50. Navigation Model: Drill-Down collection : Company AND country_code : irl AND category_code : software Lucene query
  • 52. Navigation Model: Pivot collection : Investment Lucene query
  • 53. Navigation Model: Pivot collection : Investment Query Rewriting collection : Company AND country_code : irl AND category_code : software Preceding Lucene query Lucene query funding_rounds -1 : { country_code : irl, category_code : software } JSON query
  • 54. Navigation Model: Pivot collection : Investment Lucene query funding_rounds -1 : { country_code : irl, category_code : software } JSON query
  • 56. Navigation Model: Pivot collection : Investor Lucene query investments -1 : { founded_year : 2012, funding_rounds -1 : { country_code : irl, category_code : software } } JSON query
  • 57. Comparison with BlockJoin • Lucene BlockJoin – Introduced support for indexing and searching nested data … – … for small and well-defined schema
  • 58. Lucene BlockJoin - Scalability • • Increase artificially the number of documents in the index – One document per nested data record Cache size linear with the number of nested data records – Increased memory usage
  • 59. Lucene BlockJoin - Flexibility • • • Developers must be aware of the relations between nested data records – At indexing time to tag parent records – At querying time to filter parent records Upfront effort required to design and configure the system – Define Parent-Child relationships between record collections – Define attributes for each record collection If not properly designed, risk of incorrect matches
  • 60. Comparison with BlockJoin • • BlockJoin + Works out of the box with all Lucene’s features ‒ Requires upfront design effort ‒ Memory usage dependent on nested data structure Tree-Labelling + Can handle arbitrary and large nested model + Memory friendly ‒ Have to re-think and re-implement Lucene’s features
  • 61. Conclusion • • • • • Nested data model becomes more and more prevalent Searching nested data brings new challenges: performance, scalability, flexibility Different approaches exist, each one with pros and cons SIREn plugin based on tree-labelling techniques Enables new kind of search applications, e.g., relational faceted browser, with subsecond response time • SIREn Availability – Trial license currently available – In negotiation with the University to open-source
  • 62. Acknowledgement This material is based upon works supported by the European FP7 project LOD2 (257943) and the Irish Research Council for Science, Engineering and Technology.