SlideShare a Scribd company logo
1 of 33
Download to read offline
conlin_joshua@bah.com
bende_bryan@bah.com
owen_james@bah.com

REAL-TIME INVERTED SEARCH IN THE
CLOUD USING LUCENE AND STORM
Joshua Conlin, Bryan Bende, James Owen
Table of Contents
  Problem Statement

  Storm

  Methodology

  Results
Who are we ?
Booz Allen Hamilton
–  Large consulting firm supporting many industries
•  Healthcare, Finance, Energy, Defense
–  Strategic Innovation Group
•  Focus on innovative solutions that can be applied across industries
•  Major focus on data science, big data, & information retrieval
•  Multiple clients utilizing Solr for implementing search capabilities
Client Applications & Architecture

Ingest	
  

Typical client applications allow users to:
•  Query document index using Lucene syntax

SolrCloud	
  

•  Filter and facet results
•  Save queries for future use

Web	
  App	
  
Problem Statement
How do we instantly notify users of new documents that match their
saved queries?
Constraints:
• 

Process documents in real-time, notify as soon as possible

• 

Scale with the number of saved queries (starting with tens of thousands)

• 

Result set of notifications must match saved queries

• 

Must not impact performance of the web application

• 

Data arrives at varying speeds and varying sizes
Possible Solutions
• 
• 

Second Solr instance to handle background execution of saved queries
Fork ingest to primary and secondary Solr instances, execute all the saved queries
against secondary instance

lotsOfQueries.size() = 1 X 109 //Milliard?
for (Query q : lotsOfQueries) {
q //*A* OR *B* OR …

Pros	
  
•  Easy	
  to	
  set	
  up,	
  Simple	
  
•  Works	
  for	
  a	
  consistent,	
  small	
  data	
  flow	
  

}
//… This will take forever

Cons	
  
•  Query	
  bound	
  
Possible Solutions
• 
• 

Distribute queries amongst multiple machines
Execute queries against a shared Solr (or SolrCloud) instance

lotsOfQueries.size()	
  =	
  2.5	
  X	
  108	
  	
  
for	
  (Query	
  q	
  :	
  lotsOfQueries)	
  {	
  
	
  	
  	
  	
  	
  q	
  //*A*	
  OR	
  *B*	
  OR	
  …	
  
	
  
	
  
	
  	
  	
  }	
  
	
  

lotsOfQueries.size()	
  =	
  2.5	
  X	
  108	
  	
  
for	
  (Query	
  q	
  :	
  lotsOfQueries)	
  {	
  
	
  	
  	
  	
  	
  q	
  //*C*	
  OR	
  *D*	
  OR	
  …	
  
	
  
	
  
	
  	
  	
  }	
  
	
  

Pros	
  
• 

Scalable,	
  only	
  bound	
  by	
  the	
  processing	
  of	
  the	
  
Solr	
  instance	
  

Cons	
  
• 

lotsOfQueries.size()	
  =	
  2.5	
  X	
  108	
  	
  
for	
  (Query	
  q	
  :	
  lotsOfQueries)	
  {	
  
	
  	
  	
  	
  	
  q	
  //*E*	
  OR	
  *F*	
  OR	
  …	
  
	
  
	
  
	
  	
  	
  }	
  
	
  

lotsOfQueries.size()	
  =	
  2.5	
  X	
  108	
  
	
  for	
  (Query	
  q	
  :	
  lotsOfQueries)	
  {	
  
	
  	
  	
  	
  	
  q	
  //*G*	
  OR	
  *H*	
  OR	
  …	
  
	
  
	
  
	
  	
  	
  }	
  
	
  

Who	
  is	
  maintaining	
  this	
  code???	
  

• 

SynchronizaCon	
  issues,	
  Index	
  cannot	
  be	
  
updated	
  during	
  query	
  execuCon	
  
Possible Solutions
One way to deal with the synchronization issues is to do away with a shared Solr
instance, giving each VM its own instance, then distribute the data or queries evenly
across the VMs.
Pros	
  
lotsOfQueries.size()	
  =	
  5	
  X	
  108	
  	
  
for	
  (Query	
  q	
  :	
  lotsOfQueries)	
  {	
  
	
  	
  	
  	
  	
  q	
  //*A*	
  OR	
  *B*	
  OR	
  …	
  
	
  
	
  
	
  
	
  
	
  
	
  	
  	
  }	
  

lotsOfQueries.size()	
  =	
  5	
  X	
  108	
  	
  
for	
  (Query	
  q	
  :	
  lotsOfQueries)	
  {	
  
	
  	
  	
  	
  	
  q	
  //*C*	
  OR	
  *D*	
  OR	
  …	
  
	
  
	
  
	
  
	
  
	
  
	
  	
  	
  }	
  

• 

Scalable,	
  processing	
  power	
  only	
  bound	
  by	
  
number	
  of	
  VMs	
  

• 

Can	
  handle	
  variable	
  data	
  flow,	
  query	
  
processing	
  would	
  not	
  need	
  to	
  be	
  
synchronized	
  

Cons	
  
• 

Difficult	
  to	
  maintain	
  
Possible Solutions

Is there a way we can set up this system so that it’s:
•  easy to maintain,
•  easy to scale, and
•  easy to synchronize?
Candidate Solution
• 
• 

Integrate Solr and/or Lucene with a stream processing framework
Process data in real-time, leverage proven framework for distributed stream
processing

Ingest	
  
SolrCloud	
  

Storm	
  

Web	
  App	
  

NoCficaCons	
  
Storm - Overview
• 

Storm is an open source stream processing framework.

• 

It’s a scalable platform that lets you distribute processes across a cluster quickly
and easily.

• 

You can add more resources to your cluster and easily utilize those resources in
your processing.
Storm - Components
• 
• 
• 

Nimbus – the control node for the cluster, distributes jobs through the cluster
Supervisor – one on each machine in the cluster , controls the allocation of worker
assignments on its machine
Worker – JVM process for running topology components

Nimbus	
  

Supervisor	
  

Supervisor	
  

Supervisor	
  

Worker	
  

Worker	
  

Worker	
  

Worker	
  

Worker	
  

Worker	
  

Worker	
  

Worker	
  

Worker	
  

Worker	
  

Worker	
  

Worker	
  
Storm – Core Concepts
• 

Topology – defines a running process, which includes all of the processes to be
run, the connections between those processes, and their configuration

• 

Stream – the flow of data through a topology; it is an unbounded collection of
tuples that is passed from process to process

• 

Storm has 2 types of processing units:
–  Spout – the start of a stream; it can be thought of as the source of the data;
that data can be read in however the spout wants—from a database, from a
message queue, etc.
–  Bolt – the primary processing unit for a topology; it accepts any number of
streams, does whatever processing you’ve set it to do, and outputs any
number of streams based on how you configure it
Storm – Core Concepts (continued)
• 

Stream Groupings – defines how topology processing units (spouts and bolts) are
connected to each other; some common groupings are:
–  All Grouping – stream is sent to all bolts
–  Shuffle Grouping – stream is evenly distributed across bolts
–  Fields grouping – sends tuples that match on the designated “field” to the
same bolt
How to Utilize Storm
How can we use this framework to solve our problem?

Let	
  Storm	
  distribute	
  out	
  the	
  data	
  and	
  queries	
  between	
  
processing	
  nodes	
  

…but	
  we	
  would	
  sCll	
  need	
  to	
  manage	
  a	
  Solr	
  instance	
  on	
  each	
  
VM,	
  and	
  we	
  would	
  even	
  need	
  to	
  ensure	
  synchronizaCon	
  
between	
  query	
  processing	
  bolts	
  running	
  on	
  the	
  same	
  VM.	
  
How to Utilize Storm
What if instead of having a Solr installation on each machine we ran
Solr in memory inside each of the processing bolts?
• 

Use Storm spout to distribute new documents

• 

Use Storm bolt to execute queries against EmbeddedSolrServer with
RAMDirectory
–  Incoming documents added to index
–  Queries executed
–  Documents removed from index

• 

Use Storm bolt to process query results

Bolt	
  
EmbeddedSolrServer	
  
RAMDirectory	
  
Advantages
This has several advantages:
• 

It removes the need to maintain a Solr instance on each VM.

• 

It’s easier to scale and more flexible; it doesn’t matter which Supervisor the bolts
get sent to, all the processing is self-contained.

• 

It removes the need to synchronize processing between bolts.

• 

Documents are volatile, existing queries over new data
Execution Topology
Data	
  
Spout	
  

Data	
  
Spout	
  

Data	
  
Spout	
  

Query	
  
Spout	
  

Data	
  Spout – Receives incoming
data files and sends to every
Executor Bolt
Query Spout – Coordinates
updates to queries

Executor	
  
Bolt	
  

Executor	
  
Bolt	
  

All	
  
Grouping	
  
Shuffle
Grouping

Executor	
  
Bolt	
  

NoCficaCon	
  
Bolt	
  

Executor	
  
Bolt	
  

Executor	
  
Bolt	
  

Executor Bolt – Loads and
executes queries
Notification Bolt – Generates
notifications based on results
Executor Bolt
Documents	
  

1.  Queries are loaded into memory
2.  Incoming documents are added to the
Lucene index
3.  Documents are processed when one
of the following conditions are met:
a)  The number of documents have
exceeded the max batch size
b)  The time since the last execution
is longer than the max interval
time
4.  Matching queries and document UIDs
are emitted
5.  Remove all documents from index

2

1
Query	
  List	
  

3
4
emit()	
  
Solr In-Memory Processing Bolt Issues
• 
• 

• 

• 
• 

Attempted to run Solr with in-memory index inside Storm bolt
Solr 4.5 requires:
–  http-client 4.2.3
–  http-core 4.2.2
Storm 0.8.2 & 0.9.0 require:
–  http-client 4.1.1
–  http-core 4.1
Could exclude libraries from super jar and rely on storm/lib, but Solr
expecting SystemDefaultHttpClient from 4.2.3
Could build Storm with newer version of libraries, but not
guaranteed to work
Lucene	
  In-­‐Memory	
  Processing	
  Bolt	
  
1.  IniCalizaCon	
  
–  Parse	
  Common	
  Solr	
  Schema	
  
–  Replace	
  Solr	
  Classes	
  

2.  Add	
  Documents	
  

–  Convert	
  SolrInputDocument	
  to	
  Lucene	
  
Document	
  
–  Add	
  to	
  index	
  
Advantages:	
  	
  
•  	
  Fast,	
  Lightweight	
  
•  	
  No	
  Dependency	
  Conflicts	
  
•  	
  RAMDirectory	
  backed	
  
•  	
  Easy	
  Solr	
  to	
  Lucene	
  Document	
  Conversion	
  
•  	
  Solr	
  Schema	
  based	
  

Bolt	
  
Lucene	
  Index	
  
RAMDirectory	
  
Lucene In-Memory Processing Bolt
Parse	
  Read/Parse/Update	
  Solr	
  Schema	
  File	
  using	
  Stax	
  
Create	
  IndexSchema	
  from	
  new	
  Solr	
  Schema	
  data	
  
	
  
	
  

public void addDocument(SolrInputDocument doc) throws Exception {
if (doc != null) {
Document luceneDoc = solrDocumentConverter.convert(doc);
indexWriter.addDocument(luceneDoc);
indexWriter.commit();
}
}
Prototype Solution
• 

• 

• 

Infrastructure:
–  8 node cluster on Amazon EC2
–  Each VM has 2 cores and 8G of memory
Data:
–  92,000 news article summaries
–  Average file size: ~1k
Queries:
–  Generated 1 million sample queries
–  Randomly selected terms from document set
–  Stored in MariaDB (username, query string)
–  Query Executor Bolt configured to as any subset of these queries
Prototype Solution – Monitoring Performance
• 

Metrics Provided by Storm UI
– 
– 
– 
– 

Emitted: number of tuples emitted
Transferred: number of tuples transferred (emitted * # follow-on bolts)
Acked: number of tuples acknowledged
Execute Latency: timestamp when execute function ends - timestamp when execute is
passed tuple
–  Process Latency: timestamp when ack is called - timestamp when execute is passed tuple
–  Capacity: % of the time in the last 10 minutes the bolt spent executing tuples

• 
• 

Many metrics are samples, don’t always indicate problems
Good measurement is comparing number of tuples transferred from spout, to number
of tuples acknowledged in bolt
–  If transferred number is getting increasingly higher than number of acknowledged tuples,
then the topology is not keeping up with the rate of data
Trial Runs – First Attempt

Node	
  1	
  

• 
• 
• 
• 
Node	
  1	
  
ArCcle	
  Spout	
  

8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts
Article spout emitting as fast as possible
Query execution at 1k docs or 60 seconds elapsed time
Increased number of queries on each trial: 10k, 50k, 100k, 200k, 300k, 400k, 500k
Node	
  2	
  

Node	
  3	
  

Node	
  4	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Node	
  5	
  

Node	
  6	
  

Node	
  7	
  

Node	
  8	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Results:
• 
• 

Articles emitted too fast for
bolts to keep up
If data continued to stream
at this rate, topology would
back up and drop tuples
Trial Runs – Second Attempt

Node	
  1	
  

• 
• 
• 

8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts
Article spout now places articles on queue in background thread every 100ms
Everything else the same…

Node	
  1	
  
ArCcle	
  Spout	
  

Node	
  2	
  

Node	
  3	
  

Node	
  4	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

• 

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Worker	
  x	
  4	
  

Results:

Worker	
  x	
  4	
  

• 
Node	
  5	
  

Node	
  6	
  

Node	
  7	
  

Node	
  8	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Topology performing much
better, keeping up with data
flow for query size of 10k,
50k, 100k, 200k
Slows down around 300k
queries, approx 37.5k
queries/bolt
Trials Runs – Third Attempt

Node	
  1	
  

• 
• 
• 

Each node has 4 worker slots so lets scale up
16 workers, 1 spout, 16 Query Executor Bolts, 8 Result Bolts
Everything else the same…

Node	
  1	
  
ArCcle	
  Spout	
  

Node	
  2	
  

Node	
  3	
  

Node	
  4	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  2	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  2	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  2	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  2	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Node	
  5	
  

Node	
  6	
  

Node	
  7	
  

Node	
  8	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  2	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  2	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  2	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  2	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Results:
• 
• 
• 

300k queries now keeping
up no problem
400k doing ok…
500k backing up a bit
Trial Runs – Fourth Attempt

Node	
  1	
  

• 
• 
• 

Next logical step, 32 workers, 1 spout, 32 Query Executor Bolts
Didn’t result in anticipated performance gain, 500k still too much
Hypothesizing that 2-core VMs might not be enough to get full performance from 4
worker slots

Node	
  1	
  
ArCcle	
  Spout	
  

Node	
  2	
  

Node	
  3	
  

Node	
  4	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  4	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  4	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  4	
  

Worker	
  olt	
  	
  x	
  4	
  
Query	
  Bx	
  4

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Node	
  5	
  

Node	
  6	
  

Node	
  7	
  

Node	
  8	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  4	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  4	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  4	
  

Query	
  Bx	
  4
Worker	
  olt	
  	
  x	
  4	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Result	
  Bx	
  4
Worker	
  olt	
  	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  

Worker	
  x	
  4	
  
Trials Runs – Conclusions
• 

Most important factor affecting performance is relationship between data rate and
number of queries

• 

Ideal Storm configuration is dependent on hardware executing the topology

• 

Optimal configuration resulted in 250 queries per second per bolt, 4k queries per
second across topology

• 

High level of performance from relatively small cluster
Conclusions
• 

Low barrier to entry working with Storm

• 

Easy conversion of Solr indices to Lucene Indices

• 

Simple integration between Lucene and Storm; Solr more complicated

• 

Configuration is key, tune topology to your needs

• 

Overall strategy appears to scale well for our use case, limited only by hardware
Future Considerations
• 

Adjust the batch size on the query executor bolt

• 

Combine duplicate queries (between users) if your system has many duplicates

• 

Investigate additional optimizations during Solr to Lucene

• 

Run topology with more complex queries (fielded, filtered, etc.)

• 

Investigate handling of bolt failure

• 

If ratio of incoming data to queries was reversed, consider switching the groupings
between the spouts and executor bolts
Questions?

More Related Content

What's hot

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Erik Hatcher
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?Miklos Christine
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conferenceErik Hatcher
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...Lucidworks
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptLucidworks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Anyscale
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Lucidworks
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Spark Summit
 
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, TruliaReal Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, TruliaLucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 

What's hot (20)

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Solr 4
Solr 4Solr 4
Solr 4
 
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScript
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Lucene
LuceneLucene
Lucene
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
 
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, TruliaReal Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 

Viewers also liked

Implementing Powerful IT Search on the Cloud
Implementing Powerful IT Search on the CloudImplementing Powerful IT Search on the Cloud
Implementing Powerful IT Search on the CloudRightScale
 
Practical Search in the Cloud - By Marc Krellenstein
Practical Search in the Cloud - By Marc KrellensteinPractical Search in the Cloud - By Marc Krellenstein
Practical Search in the Cloud - By Marc Krellensteinlucenerevolution
 
Semantic search in the cloud
Semantic search in the cloudSemantic search in the cloud
Semantic search in the cloudlucenerevolution
 
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11Harish Ganesan
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr CloudCominvent AS
 

Viewers also liked (8)

Implementing Powerful IT Search on the Cloud
Implementing Powerful IT Search on the CloudImplementing Powerful IT Search on the Cloud
Implementing Powerful IT Search on the Cloud
 
Practical Search in the Cloud - By Marc Krellenstein
Practical Search in the Cloud - By Marc KrellensteinPractical Search in the Cloud - By Marc Krellenstein
Practical Search in the Cloud - By Marc Krellenstein
 
Wikipedia Cloud Search Webinar
Wikipedia Cloud Search WebinarWikipedia Cloud Search Webinar
Wikipedia Cloud Search Webinar
 
Semantic search in the cloud
Semantic search in the cloudSemantic search in the cloud
Semantic search in the cloud
 
Amazon cloud search comparison report
Amazon cloud search comparison reportAmazon cloud search comparison report
Amazon cloud search comparison report
 
Cloud powered search
Cloud powered searchCloud powered search
Cloud powered search
 
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr Cloud
 

Similar to Real-time Inverted Search in the Cloud Using Lucene and Storm

Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Bryan Bende
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012TEST Huddle
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresCloudLightning
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseSandesh Rao
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scaleAnshum Gupta
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebula Project
 
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceCOMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceAntonio García-Domínguez
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrlucenerevolution
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr PerformanceLucidworks
 
Infinit's Next Generation Key-value Store - Julien Quintard and Quentin Hocqu...
Infinit's Next Generation Key-value Store - Julien Quintard and Quentin Hocqu...Infinit's Next Generation Key-value Store - Julien Quintard and Quentin Hocqu...
Infinit's Next Generation Key-value Store - Julien Quintard and Quentin Hocqu...Docker, Inc.
 
Cloud computing Module 2 First Part
Cloud computing Module 2 First PartCloud computing Module 2 First Part
Cloud computing Module 2 First PartSoumee Maschatak
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPrashant Rane
 
The Meteor Framework
The Meteor FrameworkThe Meteor Framework
The Meteor FrameworkDamien Magoni
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Lucidworks
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
 
Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Dharma Shukla
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Whats new in Autonomous Database in 2022
Whats new in Autonomous Database in 2022Whats new in Autonomous Database in 2022
Whats new in Autonomous Database in 2022Sandesh Rao
 

Similar to Real-time Inverted Search in the Cloud Using Lucene and Storm (20)

Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous Database
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
 
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceCOMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
Infinit's Next Generation Key-value Store - Julien Quintard and Quentin Hocqu...
Infinit's Next Generation Key-value Store - Julien Quintard and Quentin Hocqu...Infinit's Next Generation Key-value Store - Julien Quintard and Quentin Hocqu...
Infinit's Next Generation Key-value Store - Julien Quintard and Quentin Hocqu...
 
Cloud computing Module 2 First Part
Cloud computing Module 2 First PartCloud computing Module 2 First Part
Cloud computing Module 2 First Part
 
Storm - SpaaS
Storm - SpaaSStorm - SpaaS
Storm - SpaaS
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
 
The Meteor Framework
The Meteor FrameworkThe Meteor Framework
The Meteor Framework
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
 
Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Whats new in Autonomous Database in 2022
Whats new in Autonomous Database in 2022Whats new in Autonomous Database in 2022
Whats new in Autonomous Database in 2022
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...lucenerevolution
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platformlucenerevolution
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucenelucenerevolution
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucene
 
10 keys to Solr's Future
10 keys to Solr's Future10 keys to Solr's Future
10 keys to Solr's Future
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Real-time Inverted Search in the Cloud Using Lucene and Storm

  • 1.
  • 2. conlin_joshua@bah.com bende_bryan@bah.com owen_james@bah.com REAL-TIME INVERTED SEARCH IN THE CLOUD USING LUCENE AND STORM Joshua Conlin, Bryan Bende, James Owen
  • 3. Table of Contents   Problem Statement   Storm   Methodology   Results
  • 4. Who are we ? Booz Allen Hamilton –  Large consulting firm supporting many industries •  Healthcare, Finance, Energy, Defense –  Strategic Innovation Group •  Focus on innovative solutions that can be applied across industries •  Major focus on data science, big data, & information retrieval •  Multiple clients utilizing Solr for implementing search capabilities
  • 5. Client Applications & Architecture Ingest   Typical client applications allow users to: •  Query document index using Lucene syntax SolrCloud   •  Filter and facet results •  Save queries for future use Web  App  
  • 6. Problem Statement How do we instantly notify users of new documents that match their saved queries? Constraints: •  Process documents in real-time, notify as soon as possible •  Scale with the number of saved queries (starting with tens of thousands) •  Result set of notifications must match saved queries •  Must not impact performance of the web application •  Data arrives at varying speeds and varying sizes
  • 7. Possible Solutions •  •  Second Solr instance to handle background execution of saved queries Fork ingest to primary and secondary Solr instances, execute all the saved queries against secondary instance lotsOfQueries.size() = 1 X 109 //Milliard? for (Query q : lotsOfQueries) { q //*A* OR *B* OR … Pros   •  Easy  to  set  up,  Simple   •  Works  for  a  consistent,  small  data  flow   } //… This will take forever Cons   •  Query  bound  
  • 8. Possible Solutions •  •  Distribute queries amongst multiple machines Execute queries against a shared Solr (or SolrCloud) instance lotsOfQueries.size()  =  2.5  X  108     for  (Query  q  :  lotsOfQueries)  {            q  //*A*  OR  *B*  OR  …            }     lotsOfQueries.size()  =  2.5  X  108     for  (Query  q  :  lotsOfQueries)  {            q  //*C*  OR  *D*  OR  …            }     Pros   •  Scalable,  only  bound  by  the  processing  of  the   Solr  instance   Cons   •  lotsOfQueries.size()  =  2.5  X  108     for  (Query  q  :  lotsOfQueries)  {            q  //*E*  OR  *F*  OR  …            }     lotsOfQueries.size()  =  2.5  X  108    for  (Query  q  :  lotsOfQueries)  {            q  //*G*  OR  *H*  OR  …            }     Who  is  maintaining  this  code???   •  SynchronizaCon  issues,  Index  cannot  be   updated  during  query  execuCon  
  • 9. Possible Solutions One way to deal with the synchronization issues is to do away with a shared Solr instance, giving each VM its own instance, then distribute the data or queries evenly across the VMs. Pros   lotsOfQueries.size()  =  5  X  108     for  (Query  q  :  lotsOfQueries)  {            q  //*A*  OR  *B*  OR  …                  }   lotsOfQueries.size()  =  5  X  108     for  (Query  q  :  lotsOfQueries)  {            q  //*C*  OR  *D*  OR  …                  }   •  Scalable,  processing  power  only  bound  by   number  of  VMs   •  Can  handle  variable  data  flow,  query   processing  would  not  need  to  be   synchronized   Cons   •  Difficult  to  maintain  
  • 10. Possible Solutions Is there a way we can set up this system so that it’s: •  easy to maintain, •  easy to scale, and •  easy to synchronize?
  • 11. Candidate Solution •  •  Integrate Solr and/or Lucene with a stream processing framework Process data in real-time, leverage proven framework for distributed stream processing Ingest   SolrCloud   Storm   Web  App   NoCficaCons  
  • 12. Storm - Overview •  Storm is an open source stream processing framework. •  It’s a scalable platform that lets you distribute processes across a cluster quickly and easily. •  You can add more resources to your cluster and easily utilize those resources in your processing.
  • 13. Storm - Components •  •  •  Nimbus – the control node for the cluster, distributes jobs through the cluster Supervisor – one on each machine in the cluster , controls the allocation of worker assignments on its machine Worker – JVM process for running topology components Nimbus   Supervisor   Supervisor   Supervisor   Worker   Worker   Worker   Worker   Worker   Worker   Worker   Worker   Worker   Worker   Worker   Worker  
  • 14. Storm – Core Concepts •  Topology – defines a running process, which includes all of the processes to be run, the connections between those processes, and their configuration •  Stream – the flow of data through a topology; it is an unbounded collection of tuples that is passed from process to process •  Storm has 2 types of processing units: –  Spout – the start of a stream; it can be thought of as the source of the data; that data can be read in however the spout wants—from a database, from a message queue, etc. –  Bolt – the primary processing unit for a topology; it accepts any number of streams, does whatever processing you’ve set it to do, and outputs any number of streams based on how you configure it
  • 15. Storm – Core Concepts (continued) •  Stream Groupings – defines how topology processing units (spouts and bolts) are connected to each other; some common groupings are: –  All Grouping – stream is sent to all bolts –  Shuffle Grouping – stream is evenly distributed across bolts –  Fields grouping – sends tuples that match on the designated “field” to the same bolt
  • 16. How to Utilize Storm How can we use this framework to solve our problem? Let  Storm  distribute  out  the  data  and  queries  between   processing  nodes   …but  we  would  sCll  need  to  manage  a  Solr  instance  on  each   VM,  and  we  would  even  need  to  ensure  synchronizaCon   between  query  processing  bolts  running  on  the  same  VM.  
  • 17. How to Utilize Storm What if instead of having a Solr installation on each machine we ran Solr in memory inside each of the processing bolts? •  Use Storm spout to distribute new documents •  Use Storm bolt to execute queries against EmbeddedSolrServer with RAMDirectory –  Incoming documents added to index –  Queries executed –  Documents removed from index •  Use Storm bolt to process query results Bolt   EmbeddedSolrServer   RAMDirectory  
  • 18. Advantages This has several advantages: •  It removes the need to maintain a Solr instance on each VM. •  It’s easier to scale and more flexible; it doesn’t matter which Supervisor the bolts get sent to, all the processing is self-contained. •  It removes the need to synchronize processing between bolts. •  Documents are volatile, existing queries over new data
  • 19. Execution Topology Data   Spout   Data   Spout   Data   Spout   Query   Spout   Data  Spout – Receives incoming data files and sends to every Executor Bolt Query Spout – Coordinates updates to queries Executor   Bolt   Executor   Bolt   All   Grouping   Shuffle Grouping Executor   Bolt   NoCficaCon   Bolt   Executor   Bolt   Executor   Bolt   Executor Bolt – Loads and executes queries Notification Bolt – Generates notifications based on results
  • 20. Executor Bolt Documents   1.  Queries are loaded into memory 2.  Incoming documents are added to the Lucene index 3.  Documents are processed when one of the following conditions are met: a)  The number of documents have exceeded the max batch size b)  The time since the last execution is longer than the max interval time 4.  Matching queries and document UIDs are emitted 5.  Remove all documents from index 2 1 Query  List   3 4 emit()  
  • 21. Solr In-Memory Processing Bolt Issues •  •  •  •  •  Attempted to run Solr with in-memory index inside Storm bolt Solr 4.5 requires: –  http-client 4.2.3 –  http-core 4.2.2 Storm 0.8.2 & 0.9.0 require: –  http-client 4.1.1 –  http-core 4.1 Could exclude libraries from super jar and rely on storm/lib, but Solr expecting SystemDefaultHttpClient from 4.2.3 Could build Storm with newer version of libraries, but not guaranteed to work
  • 22. Lucene  In-­‐Memory  Processing  Bolt   1.  IniCalizaCon   –  Parse  Common  Solr  Schema   –  Replace  Solr  Classes   2.  Add  Documents   –  Convert  SolrInputDocument  to  Lucene   Document   –  Add  to  index   Advantages:     •   Fast,  Lightweight   •   No  Dependency  Conflicts   •   RAMDirectory  backed   •   Easy  Solr  to  Lucene  Document  Conversion   •   Solr  Schema  based   Bolt   Lucene  Index   RAMDirectory  
  • 23. Lucene In-Memory Processing Bolt Parse  Read/Parse/Update  Solr  Schema  File  using  Stax   Create  IndexSchema  from  new  Solr  Schema  data       public void addDocument(SolrInputDocument doc) throws Exception { if (doc != null) { Document luceneDoc = solrDocumentConverter.convert(doc); indexWriter.addDocument(luceneDoc); indexWriter.commit(); } }
  • 24. Prototype Solution •  •  •  Infrastructure: –  8 node cluster on Amazon EC2 –  Each VM has 2 cores and 8G of memory Data: –  92,000 news article summaries –  Average file size: ~1k Queries: –  Generated 1 million sample queries –  Randomly selected terms from document set –  Stored in MariaDB (username, query string) –  Query Executor Bolt configured to as any subset of these queries
  • 25. Prototype Solution – Monitoring Performance •  Metrics Provided by Storm UI –  –  –  –  Emitted: number of tuples emitted Transferred: number of tuples transferred (emitted * # follow-on bolts) Acked: number of tuples acknowledged Execute Latency: timestamp when execute function ends - timestamp when execute is passed tuple –  Process Latency: timestamp when ack is called - timestamp when execute is passed tuple –  Capacity: % of the time in the last 10 minutes the bolt spent executing tuples •  •  Many metrics are samples, don’t always indicate problems Good measurement is comparing number of tuples transferred from spout, to number of tuples acknowledged in bolt –  If transferred number is getting increasingly higher than number of acknowledged tuples, then the topology is not keeping up with the rate of data
  • 26. Trial Runs – First Attempt Node  1   •  •  •  •  Node  1   ArCcle  Spout   8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts Article spout emitting as fast as possible Query execution at 1k docs or 60 seconds elapsed time Increased number of queries on each trial: 10k, 50k, 100k, 200k, 300k, 400k, 500k Node  2   Node  3   Node  4   Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4   Node  5   Node  6   Node  7   Node  8   Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4   Results: •  •  Articles emitted too fast for bolts to keep up If data continued to stream at this rate, topology would back up and drop tuples
  • 27. Trial Runs – Second Attempt Node  1   •  •  •  8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts Article spout now places articles on queue in background thread every 100ms Everything else the same… Node  1   ArCcle  Spout   Node  2   Node  3   Node  4   Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   •  Result  Bx  4 Worker  olt     Worker  x  4   Results: Worker  x  4   •  Node  5   Node  6   Node  7   Node  8   Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Query  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4   Topology performing much better, keeping up with data flow for query size of 10k, 50k, 100k, 200k Slows down around 300k queries, approx 37.5k queries/bolt
  • 28. Trials Runs – Third Attempt Node  1   •  •  •  Each node has 4 worker slots so lets scale up 16 workers, 1 spout, 16 Query Executor Bolts, 8 Result Bolts Everything else the same… Node  1   ArCcle  Spout   Node  2   Node  3   Node  4   Query  Bx  4 Worker  olt    x  2   Query  Bx  4 Worker  olt    x  2   Query  Bx  4 Worker  olt    x  2   Query  Bx  4 Worker  olt    x  2   Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4   Node  5   Node  6   Node  7   Node  8   Query  Bx  4 Worker  olt    x  2   Query  Bx  4 Worker  olt    x  2   Query  Bx  4 Worker  olt    x  2   Query  Bx  4 Worker  olt    x  2   Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4   Results: •  •  •  300k queries now keeping up no problem 400k doing ok… 500k backing up a bit
  • 29. Trial Runs – Fourth Attempt Node  1   •  •  •  Next logical step, 32 workers, 1 spout, 32 Query Executor Bolts Didn’t result in anticipated performance gain, 500k still too much Hypothesizing that 2-core VMs might not be enough to get full performance from 4 worker slots Node  1   ArCcle  Spout   Node  2   Node  3   Node  4   Query  Bx  4 Worker  olt    x  4   Query  Bx  4 Worker  olt    x  4   Query  Bx  4 Worker  olt    x  4   Worker  olt    x  4   Query  Bx  4 Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4   Node  5   Node  6   Node  7   Node  8   Query  Bx  4 Worker  olt    x  4   Query  Bx  4 Worker  olt    x  4   Query  Bx  4 Worker  olt    x  4   Query  Bx  4 Worker  olt    x  4   Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Result  Bx  4 Worker  olt     Worker  x  4   Worker  x  4   Worker  x  4   Worker  x  4  
  • 30. Trials Runs – Conclusions •  Most important factor affecting performance is relationship between data rate and number of queries •  Ideal Storm configuration is dependent on hardware executing the topology •  Optimal configuration resulted in 250 queries per second per bolt, 4k queries per second across topology •  High level of performance from relatively small cluster
  • 31. Conclusions •  Low barrier to entry working with Storm •  Easy conversion of Solr indices to Lucene Indices •  Simple integration between Lucene and Storm; Solr more complicated •  Configuration is key, tune topology to your needs •  Overall strategy appears to scale well for our use case, limited only by hardware
  • 32. Future Considerations •  Adjust the batch size on the query executor bolt •  Combine duplicate queries (between users) if your system has many duplicates •  Investigate additional optimizations during Solr to Lucene •  Run topology with more complex queries (fielded, filtered, etc.) •  Investigate handling of bolt failure •  If ratio of incoming data to queries was reversed, consider switching the groupings between the spouts and executor bolts