SlideShare a Scribd company logo
1 of 57
Download to read offline
Numeric Range Queries
in Lucene and Solr
kirilchukvadim@gmail.com
Agenda:
● What is RangeQuery
● Which field type to use for Numerics
● Range stuff under the hood (run!)
● NumericRangeQuery
● Useful links
Agenda:
● What is RangeQuery
● Which field type to use for Numerics
● Range stuff under the hood (run!)
● NumericRangeQuery
● Useful links
Range Queries:
A range query is a type of query that matches
all documents where some value is between an
upper and lower boundary:
Give me:
● Jeans with price from 200 to 300$
● Car with length from 5 to 10m
● ...
Range Queries:
In solr range query is as simple as:
q = field:[100 TO 200]
We will talk about Numeric Range Queries
but you can use range queries for text too:
q = field:[A TO Z]
Agenda:
● What is RangeQuery
● Which field type to use for Numerics
● Range stuff under the hood (run!)
● NumericRangeQuery
● Useful links (relax)
Which field type?
Which field type to use for “range” fields (let’s
stick with int) in schema?
● solr.IntField
● or maybe solr.SortableIntField
● or maybe solr.TrieIntField
Which field type?
Let’s assume we have:
● 11 documents, id: 1,2,3,..11
● each doc has single value “int” price field
● document id is the same as it’s price
● q = *:*
"numFound": 11,
"docs": [
{
"id": 1, “price_field": 1
},
{
"id": 2, “price_field": 2
},
...
{
"id": 11, “price_field": 11 }]
Which field type - solr.IntField
q = price_field:[1 TO 10]
Which field type - solr.IntField
q = price_field:[1 TO 10]
"numFound": 2,
"start": 0,
"docs": [
{
"price_field": 1
},
{
"price_field": 10
}
]
}
Which field type - solr.IntField
Store and index the text value verbatim and
hence don't correctly support range queries,
since the lexicographic ordering isn't equal to
the numeric ordering
[1,10],11,2,3,4,5,6,7,8,9
Interesting, but “sort by” works fine..
Clever comparator knows that values
are ints!
Which field type - solr.SortableIntField
● q = price_field:[1 TO 10]
○ "numFound": 10

● “Sortable”, in fact, refer to the notion of
making the numbers have correctly sorted
order. It’s not about “sort by” actually!
● Processed and compared as strings!!!
tricky string encoding:
NumberUtils.int2sortableStr(...)
● Deprecated and will be removed in 5.X
● What should i use then?
Which field type - solr.TrieIntField
● q = price_field:[1 TO 10]
○ "numFound": 10

● Recommended as replacement for IntField
and SortableIntField in javadoc
● Default for primitive fields in reference
schema
● Said to be fast for range queries (actually
depends on precision step)
● Tricky and, btw wtf is precision step?
Agenda:
● What is RangeQuery
● Which field type to use for Numerics
● Range stuff under the hood (run!)
● NumericRangeQuery
● Useful links
Under the hood - Index
Under the hood - Index
NumericTokenStream is where half of magic
happens!
● precision step = 1
● value = 11
00000000

00000000

00000000

00001011

● Let’s see how it will be indexed!
Under the hood - Index

Field with precisionStep=1
Under the hood - Index
shift=0

00001011

11

shift=1

00001010

10 = 5 << 1

shift=2

00001000

8 = 2 << 2

shift=3

00001000

8 = 1 << 3

shift=4

00000000

0 = 0 << 4

shift=5

00000000

0 = 0 << 5

continue…
Under the hood - Index
How much for an integer?
11111111

11111111

11111111

11111111

Algorithm requires to index all 32/precisionStep
terms
So, for “11” we have 11, 10, 8, 8, 0, 0, 0, 0, 0….0
Under the hood - Index
Okay! We indexed 32 tokens for the field.
(TermDictionary! Postings!) Where is the trick?

Stay tuned!
Under the hood - Query
Under the hood - Query
Sub-classes of FieldType could override
#getRangeQuery(...) to provide their own range
query implementation.
If not, then likely you will have:
MultiTermQuery rangeQuery = TermRangeQuery.
newStringRange(...)
TrieField overrides it. And here comes...
Agenda:
● What is RangeQuery
● Which field type to use for Numerics
● Range stuff under the hood (run!)
● NumericRangeQuery
● Useful links
Numeric Range Query (Decimal)
● Decimal example, precisionStep = ten
● q = price:[423 TO 642]
Numeric Range Query (Binary)
● precisionStep = 1
● q = price:[3 TO 12]

0

1

2

3

4

5

6

7

8

9

10

11

12

13
Numeric Range Query (Binary)
● precisionStep = 1
● q = price:[3 TO 12]

SHIFT = 1
0

0

1

1

2

3

2

3

4

5

6

4

7

8

5

9

10

6

11

12

13
...

Numeric Range Query (Binary)
● precisionStep = 1
● q = price:[3 TO 12]

0

0

0

1

0

0

0

0

1

1

2

1

2

3

2

3

4

5

6

3

4

7

8

5

9

10

6

11

12

13
Numeric Range Query (Binary)
● precisionStep = 1
● q = price:[3 TO 12]
0

1

0

0

0

0

1

1

2

1

2

3

2

3

4

5

6

3

4

7

8

5

9

10

6

11

12

13
Numeric Range Query (How?)

So, the questions is:
How to create query for the algorithm?
Numeric Range Query (How?)
Let’s come back to TrieField#getRangeQuery(...)
There are several options:
● field is multiValued, hasDocValues, not indexed
○ super#getRangeQuery
● field is hasDocValues, not indexed
○ new ConstantScoreQuery (
FieldCacheRangeFilter.newIntRange(...) )
● otherwise ta-da
○ NumericRangeQuery.newIntRange(...)
Numeric Range Query (How?)
NumericRangeQuery extends MultiTermQuery
which is:
An abstract Query that matches documents
containing a subset of terms provided by a
FilteredTermsEnum enumeration.
This query cannot be used directly(abstract); you
must subclass it and define getTermsEnum(Terms,
AttributeSource) to provide a FilteredTermsEnum
that iterates through the terms to be matched.
Numeric Range Query (How?)
Let’s understand how #getTermsEnum works.
Returns new NumericRangeTermsEnum(...)
The main part is: NumericUtils.splitIntRange(...)
Numeric Range Query (How?)
Algorithm uses binary masks very much:
for (int shift=0; noRanges(); shift += precisionStep):
diff = 1L << (shift + precisionStep);
mask = ((1L << precisionStep) - 1L) << shift;
diff=2
0

0

1

1

2

3

Diff is distance between upper level neighbors
Mask is to check if currentLevel node has nodes
lower or upper. (1,3 hasLower, 0,2 hasUpper)
Numeric Range Query (How?)
hasLower = (minBound & mask) != 0L;
hasUpper = (maxBound & mask) != mask;
if (hasLower)
addRange(builder, valSize, minBound, minBound |
mask, shift);
if (hasUpper)
addRange(builder, valSize, maxBound & ~mask,
maxBound, shift);
Numeric Range Query (How?)
hasLower = (minBound & mask) != 0L;
hasUpper = (maxBound & mask) != mask;
nextMinBound = (hasLower ? (minBound + diff) :
minBound) & ~mask;
nextMaxBound = (hasUpper ? (maxBound - diff) :
maxBound) & ~mask;
Numeric Range Query (How?)
// If we are in the lowest precision or the next
precision is not available.
addRange(builder, valSize, minBound, maxBound,
shift);
// exit the split recursion loop (FOR)
Numeric Range Query (How?)
●
●
●
●
●

shift = 0
diff = 0b00000010 = 2
mask = 0b00000001 = 1
hasLower = (3 & 1 != 0)? = true
hasUpper = (12 & 1 != 1)? = true
○ addRange 3..(3 | 1) = 3..3
○ addRange 12..(12 & ~1) = 12..12

● nextMin = (3 + 2) & ~1 = 4
● nextMax = (12 - 2) & ~1 = 10

0

1

2

3

4

5

6

7

8

9

10

11

12

13
Numeric Range Query (How?)
●
●
●
●
●
●
●
●

min:4; max:10
shift = 1
diff = 0b00000100 = 4
mask = 0b00000010 = 2
hasLower = (4 & 2 != 0) ? = false
hasUpper = (10 & 2 != 2) ? = false
nextMin = min
nextMax = max
0

0

1

1

2

3

2

3

4

5

6

4

7

8

5

9

10

6

11

12

13
Numeric Range Query (How?)
●
●
●
●
●
●
●
●

min:4; max:10
shift = 2
diff = 0b00001000 = 8
mask = 0b00000100 = 4
hasLower = (4 & 4 != 0) ? = true
hasUpper = (10 & 4 != 4) ? = true
nextMin = (4 + 8) & ~4 = 8 => min > max END
nextMax = (10 - 8) & ~4 = 0 => range 1..2 shift =
2
2
3
0

1

0

0

1

1

2

3

2

3

4

5

6

4

7

8

5

9

10

6

11

12

13
Numeric Range Query (How?)
TestNumericUtils#testSplitIntRange
assertIntRangeSplit(lower, upper, precisionStep, expectBounds,
shifts)
assertIntRangeSplit(3, 12, 1, true,
Arrays.asList(
-2147483645,-2147483645, // 3,3
-2147483636,-2147483636, // 12,12
536870913, 536870914),
// 1, 2 for shift == 2
Arrays.asList(0, 0, 2)
); // Crappy unsigned int conversions are done in the asserts
Numeric Range Query (How?)
So, NumericTermsEnum generates and remembers
all ranges to match.
Numeric Range Query (How?)
Basically TermsEnum is an Iterator to seek or step
through terms in some order.
In our case order is:
0

1

2

3

4

5

6

7

8

9

10

11

12

Then (shift = 1):
0

1

2

3

4

5

6

Then (shift = 2)
0

2

1

...

3

13
Numeric Range Query (How?)
Actually we have FilteredTermsEnum:

1. Only red terms are accepted by our enumerator
2. If term is not accepted we advance:
FilteredTermsEnum#nextSeekTerm(currentTerm)
TermsEnum#seekCeil(termToSeek)
Seek term depends on currentTerm and
generated ranges.
Numeric Range Query (How?)
Ok, now we have TermsEnum for MiltiTermQuery
and enum is able to seek through only those terms
which match appropriate sub ranges.
The question is how to convert TermsEnum to
Query!?
Numeric Range Query (How?)
The last trick is query#rewrite() method of
MultiTermQuery (rewrite is always called on query
before performing search):
public final Query rewrite(IndexReader reader) {
return rewriteMethod.rewrite(reader, this);
}

Oh, “rewriteMethod” how interesting… It defines how
the query is rewritten.
Numeric Range Query (How?)
There are plenty of different rewrite methods, but
most interesting for us are:
●
CONSTANT_SCORE_*
○ BOOLEAN_QUERY_REWRITE
○ FILTER_REWRITE
○ AUTO_REWRITE_DEFAULT
Numeric Range Query (How?)
BOOLEAN_QUERY_REWRITE

1. Collect terms (TermCollector) by using
#getTermsEnum(...)
2. For each term create TermQuery
3. return BooleanQuery with all TermQuery as leafs
Numeric Range Query (How?)
FILTER_REWRITE

1.
2.
3.
4.
5.

Get termsEnum by using #getTermsEnum(...)
Create FixedBitSet
Get DocsEnum for each term
Iterate over docs and bitSet.set(docid);
return ConstantScoreQuery over filter (bitSet)
Numeric Range Query (How?)
AUTO_REWRITE_DEFAULT
If the number of documents to be visited in the
postings exceeds some percentage of the maxDoc()
for the index then FILTER_REWRITE is used,
otherwise BOOLEAN_REWRITE is used.
Agenda:
● ..
● I promised. Precision Step!
● ...
Precision step
So, what is precision step and how it affects
performance?
● Defines how much terms to index for each value
○ Lower step values mean more precisions and
consequently more terms in index
○ indexedTermsPerValue = bitsPerVal / pStep
○ Lower precision terms are non unique, so term
dictionary doesn’t grow much, however
postings file does
Precision step
So, what is precision step and how it affects
performance?
● ...
○ Smaller precision step means less number of
terms to match, which optimizes query speed
○ But more terms to seek in index
○ You can index with a lower precision step value
and test search speed using a multiple of the
original step value.
○ Ideal step is found by testing only
Precision step (Results)
According to NumericRangeQuery javadoc:
● Opteron64 machine, Java 1.5, 8 bit precision step
● 500k docs index
● TermRangeQuery in BooleanRewriteMode took
about 30-40 seconds
● TermRangeQuery in FilterRewriteMode took
about 5 seconds
● NumericRangeQuery took < 100ms
Agenda:
● What is RangeQuery
● Which field type to use for Numerics
● Range stuff under the hood (run!)
● NumericRangeQuery
● Useful links
Useful links
● http://searchhub.org/2009/05/13/exploringlucene-and-solrs-trierange-capabilities/
● http://www.panfmp.org/
● http://epic.awi.de/17813/1/Sch2007br.pdf
● http://lucene.apache.
org/core/4_3_1/core/org/apache/lucene/search/
NumericRangeQuery.html
● http://en.wikipedia.org/wiki/Range_tree
● me
http://plus.google.com/+VadimKirilchuk
Numeric Range Queries in Lucene and Solr

More Related Content

What's hot

Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Dr. Volkan OBAN
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandasPiyush rai
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
Datastructures in python
Datastructures in pythonDatastructures in python
Datastructures in pythonhydpy
 
What is Python Lambda Function? Python Tutorial | Edureka
What is Python Lambda Function? Python Tutorial | EdurekaWhat is Python Lambda Function? Python Tutorial | Edureka
What is Python Lambda Function? Python Tutorial | EdurekaEdureka!
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solrKnoldus Inc.
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemShirshanka Das
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine kiran palaka
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkDatabricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 

What's hot (20)

Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Logstash
LogstashLogstash
Logstash
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Datastructures in python
Datastructures in pythonDatastructures in python
Datastructures in python
 
What is Python Lambda Function? Python Tutorial | Edureka
What is Python Lambda Function? Python Tutorial | EdurekaWhat is Python Lambda Function? Python Tutorial | Edureka
What is Python Lambda Function? Python Tutorial | Edureka
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solr
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 

Viewers also liked

Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query ParsingErik Hatcher
 
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...odnoklassniki.ru
 
Advanced query parsing techniques
Advanced query parsing techniquesAdvanced query parsing techniques
Advanced query parsing techniqueslucenerevolution
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and TricksErik Hatcher
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...Lucidworks
 
Simple fuzzy name matching in solr
Simple fuzzy name matching in solrSimple fuzzy name matching in solr
Simple fuzzy name matching in solrDavid Murgatroyd
 
Understanding and visualizing solr explain information - Rafal Kuc
Understanding and visualizing solr explain information - Rafal KucUnderstanding and visualizing solr explain information - Rafal Kuc
Understanding and visualizing solr explain information - Rafal Kuclucenerevolution
 
Roaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 reportRoaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 reportDaniel Lemire
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
 
Lucene solr 4 spatial extended deep dive
Lucene solr 4 spatial   extended deep diveLucene solr 4 spatial   extended deep dive
Lucene solr 4 spatial extended deep divelucenerevolution
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Lucidworks
 
Careers in sports commentator
Careers in sports commentatorCareers in sports commentator
Careers in sports commentatorentranzz123
 
The structure of english sentence
The structure of english sentenceThe structure of english sentence
The structure of english sentenceAltyna Hetty
 
Emphatic speech
Emphatic speechEmphatic speech
Emphatic speechSyvkova
 
Emphatic pronouns
Emphatic pronounsEmphatic pronouns
Emphatic pronounskerrie1996
 

Viewers also liked (20)

Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
 
Advanced query parsing techniques
Advanced query parsing techniquesAdvanced query parsing techniques
Advanced query parsing techniques
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Simple fuzzy name matching in solr
Simple fuzzy name matching in solrSimple fuzzy name matching in solr
Simple fuzzy name matching in solr
 
Understanding and visualizing solr explain information - Rafal Kuc
Understanding and visualizing solr explain information - Rafal KucUnderstanding and visualizing solr explain information - Rafal Kuc
Understanding and visualizing solr explain information - Rafal Kuc
 
Roaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 reportRoaring Bitmap : June 2015 report
Roaring Bitmap : June 2015 report
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Lucene solr 4 spatial extended deep dive
Lucene solr 4 spatial   extended deep diveLucene solr 4 spatial   extended deep dive
Lucene solr 4 spatial extended deep dive
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Careers in sports commentator
Careers in sports commentatorCareers in sports commentator
Careers in sports commentator
 
24 emphasis
24   emphasis24   emphasis
24 emphasis
 
Apache SolrCloud
Apache SolrCloudApache SolrCloud
Apache SolrCloud
 
The structure of english sentence
The structure of english sentenceThe structure of english sentence
The structure of english sentence
 
Solr Anti Patterns
Solr Anti PatternsSolr Anti Patterns
Solr Anti Patterns
 
Emphatic speech
Emphatic speechEmphatic speech
Emphatic speech
 
Tuning Solr for Logs
Tuning Solr for LogsTuning Solr for Logs
Tuning Solr for Logs
 
Emphatic pronouns
Emphatic pronounsEmphatic pronouns
Emphatic pronouns
 

Similar to Numeric Range Queries in Lucene and Solr

(Data Structure) Chapter11 searching & sorting
(Data Structure) Chapter11 searching & sorting(Data Structure) Chapter11 searching & sorting
(Data Structure) Chapter11 searching & sortingFadhil Ismail
 
Chapter 11 - Sorting and Searching
Chapter 11 - Sorting and SearchingChapter 11 - Sorting and Searching
Chapter 11 - Sorting and SearchingEduardo Bergavera
 
Classical programming interview questions
Classical programming interview questionsClassical programming interview questions
Classical programming interview questionsGradeup
 
Sorting Seminar Presentation by Ashin Guha Majumder
Sorting Seminar Presentation by Ashin Guha MajumderSorting Seminar Presentation by Ashin Guha Majumder
Sorting Seminar Presentation by Ashin Guha MajumderAshin Guha Majumder
 
An introduction to functional programming with Swift
An introduction to functional programming with SwiftAn introduction to functional programming with Swift
An introduction to functional programming with SwiftFatih Nayebi, Ph.D.
 
Lecture 02: Preliminaries of Data structure
Lecture 02: Preliminaries of Data structureLecture 02: Preliminaries of Data structure
Lecture 02: Preliminaries of Data structureNurjahan Nipa
 
DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..KarthikeyaLanka1
 
Interview questions slide deck
Interview questions slide deckInterview questions slide deck
Interview questions slide deckMikeBegley
 
Insersion & Bubble Sort in Algoritm
Insersion & Bubble Sort in AlgoritmInsersion & Bubble Sort in Algoritm
Insersion & Bubble Sort in AlgoritmEhsan Ehrari
 
SQL BUILT-IN FUNCTION
SQL BUILT-IN FUNCTIONSQL BUILT-IN FUNCTION
SQL BUILT-IN FUNCTIONArun Sial
 
JavaScript introduction 1 ( Variables And Values )
JavaScript introduction 1 ( Variables And Values )JavaScript introduction 1 ( Variables And Values )
JavaScript introduction 1 ( Variables And Values )Victor Verhaagen
 
VISUAL BASIC 6 - CONTROLS AND DECLARATIONS
VISUAL BASIC 6 - CONTROLS AND DECLARATIONSVISUAL BASIC 6 - CONTROLS AND DECLARATIONS
VISUAL BASIC 6 - CONTROLS AND DECLARATIONSSuraj Kumar
 
PythonStudyMaterialSTudyMaterial.pdf
PythonStudyMaterialSTudyMaterial.pdfPythonStudyMaterialSTudyMaterial.pdf
PythonStudyMaterialSTudyMaterial.pdfdata2businessinsight
 
Python programming workshop session 2
Python programming workshop session 2Python programming workshop session 2
Python programming workshop session 2Abdul Haseeb
 

Similar to Numeric Range Queries in Lucene and Solr (20)

(Data Structure) Chapter11 searching & sorting
(Data Structure) Chapter11 searching & sorting(Data Structure) Chapter11 searching & sorting
(Data Structure) Chapter11 searching & sorting
 
Chapter 11 - Sorting and Searching
Chapter 11 - Sorting and SearchingChapter 11 - Sorting and Searching
Chapter 11 - Sorting and Searching
 
Classical programming interview questions
Classical programming interview questionsClassical programming interview questions
Classical programming interview questions
 
Sorting Seminar Presentation by Ashin Guha Majumder
Sorting Seminar Presentation by Ashin Guha MajumderSorting Seminar Presentation by Ashin Guha Majumder
Sorting Seminar Presentation by Ashin Guha Majumder
 
An introduction to functional programming with Swift
An introduction to functional programming with SwiftAn introduction to functional programming with Swift
An introduction to functional programming with Swift
 
Clojure basics
Clojure basicsClojure basics
Clojure basics
 
Lecture 02: Preliminaries of Data structure
Lecture 02: Preliminaries of Data structureLecture 02: Preliminaries of Data structure
Lecture 02: Preliminaries of Data structure
 
DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..
 
Python lecture 05
Python lecture 05Python lecture 05
Python lecture 05
 
Interview questions slide deck
Interview questions slide deckInterview questions slide deck
Interview questions slide deck
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
Insersion & Bubble Sort in Algoritm
Insersion & Bubble Sort in AlgoritmInsersion & Bubble Sort in Algoritm
Insersion & Bubble Sort in Algoritm
 
SQL BUILT-IN FUNCTION
SQL BUILT-IN FUNCTIONSQL BUILT-IN FUNCTION
SQL BUILT-IN FUNCTION
 
JavaScript introduction 1 ( Variables And Values )
JavaScript introduction 1 ( Variables And Values )JavaScript introduction 1 ( Variables And Values )
JavaScript introduction 1 ( Variables And Values )
 
Computer programming 2 Lesson 10
Computer programming 2  Lesson 10Computer programming 2  Lesson 10
Computer programming 2 Lesson 10
 
VISUAL BASIC 6 - CONTROLS AND DECLARATIONS
VISUAL BASIC 6 - CONTROLS AND DECLARATIONSVISUAL BASIC 6 - CONTROLS AND DECLARATIONS
VISUAL BASIC 6 - CONTROLS AND DECLARATIONS
 
Curvefitting
CurvefittingCurvefitting
Curvefitting
 
PythonStudyMaterialSTudyMaterial.pdf
PythonStudyMaterialSTudyMaterial.pdfPythonStudyMaterialSTudyMaterial.pdf
PythonStudyMaterialSTudyMaterial.pdf
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
Python programming workshop session 2
Python programming workshop session 2Python programming workshop session 2
Python programming workshop session 2
 

Recently uploaded

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Recently uploaded (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Numeric Range Queries in Lucene and Solr

  • 1. Numeric Range Queries in Lucene and Solr kirilchukvadim@gmail.com
  • 2. Agenda: ● What is RangeQuery ● Which field type to use for Numerics ● Range stuff under the hood (run!) ● NumericRangeQuery ● Useful links
  • 3. Agenda: ● What is RangeQuery ● Which field type to use for Numerics ● Range stuff under the hood (run!) ● NumericRangeQuery ● Useful links
  • 4. Range Queries: A range query is a type of query that matches all documents where some value is between an upper and lower boundary: Give me: ● Jeans with price from 200 to 300$ ● Car with length from 5 to 10m ● ...
  • 5. Range Queries: In solr range query is as simple as: q = field:[100 TO 200] We will talk about Numeric Range Queries but you can use range queries for text too: q = field:[A TO Z]
  • 6. Agenda: ● What is RangeQuery ● Which field type to use for Numerics ● Range stuff under the hood (run!) ● NumericRangeQuery ● Useful links (relax)
  • 7. Which field type? Which field type to use for “range” fields (let’s stick with int) in schema? ● solr.IntField ● or maybe solr.SortableIntField ● or maybe solr.TrieIntField
  • 8. Which field type? Let’s assume we have: ● 11 documents, id: 1,2,3,..11 ● each doc has single value “int” price field ● document id is the same as it’s price ● q = *:* "numFound": 11, "docs": [ { "id": 1, “price_field": 1 }, { "id": 2, “price_field": 2 }, ... { "id": 11, “price_field": 11 }]
  • 9. Which field type - solr.IntField q = price_field:[1 TO 10]
  • 10. Which field type - solr.IntField q = price_field:[1 TO 10] "numFound": 2, "start": 0, "docs": [ { "price_field": 1 }, { "price_field": 10 } ] }
  • 11.
  • 12. Which field type - solr.IntField Store and index the text value verbatim and hence don't correctly support range queries, since the lexicographic ordering isn't equal to the numeric ordering [1,10],11,2,3,4,5,6,7,8,9 Interesting, but “sort by” works fine.. Clever comparator knows that values are ints!
  • 13. Which field type - solr.SortableIntField ● q = price_field:[1 TO 10] ○ "numFound": 10 ● “Sortable”, in fact, refer to the notion of making the numbers have correctly sorted order. It’s not about “sort by” actually! ● Processed and compared as strings!!! tricky string encoding: NumberUtils.int2sortableStr(...) ● Deprecated and will be removed in 5.X ● What should i use then?
  • 14. Which field type - solr.TrieIntField ● q = price_field:[1 TO 10] ○ "numFound": 10 ● Recommended as replacement for IntField and SortableIntField in javadoc ● Default for primitive fields in reference schema ● Said to be fast for range queries (actually depends on precision step) ● Tricky and, btw wtf is precision step?
  • 15. Agenda: ● What is RangeQuery ● Which field type to use for Numerics ● Range stuff under the hood (run!) ● NumericRangeQuery ● Useful links
  • 16. Under the hood - Index
  • 17. Under the hood - Index NumericTokenStream is where half of magic happens! ● precision step = 1 ● value = 11 00000000 00000000 00000000 00001011 ● Let’s see how it will be indexed!
  • 18. Under the hood - Index Field with precisionStep=1
  • 19. Under the hood - Index shift=0 00001011 11 shift=1 00001010 10 = 5 << 1 shift=2 00001000 8 = 2 << 2 shift=3 00001000 8 = 1 << 3 shift=4 00000000 0 = 0 << 4 shift=5 00000000 0 = 0 << 5 continue…
  • 20. Under the hood - Index How much for an integer? 11111111 11111111 11111111 11111111 Algorithm requires to index all 32/precisionStep terms So, for “11” we have 11, 10, 8, 8, 0, 0, 0, 0, 0….0
  • 21. Under the hood - Index Okay! We indexed 32 tokens for the field. (TermDictionary! Postings!) Where is the trick? Stay tuned!
  • 22. Under the hood - Query
  • 23. Under the hood - Query Sub-classes of FieldType could override #getRangeQuery(...) to provide their own range query implementation. If not, then likely you will have: MultiTermQuery rangeQuery = TermRangeQuery. newStringRange(...) TrieField overrides it. And here comes...
  • 24. Agenda: ● What is RangeQuery ● Which field type to use for Numerics ● Range stuff under the hood (run!) ● NumericRangeQuery ● Useful links
  • 25. Numeric Range Query (Decimal) ● Decimal example, precisionStep = ten ● q = price:[423 TO 642]
  • 26. Numeric Range Query (Binary) ● precisionStep = 1 ● q = price:[3 TO 12] 0 1 2 3 4 5 6 7 8 9 10 11 12 13
  • 27. Numeric Range Query (Binary) ● precisionStep = 1 ● q = price:[3 TO 12] SHIFT = 1 0 0 1 1 2 3 2 3 4 5 6 4 7 8 5 9 10 6 11 12 13
  • 28. ... Numeric Range Query (Binary) ● precisionStep = 1 ● q = price:[3 TO 12] 0 0 0 1 0 0 0 0 1 1 2 1 2 3 2 3 4 5 6 3 4 7 8 5 9 10 6 11 12 13
  • 29. Numeric Range Query (Binary) ● precisionStep = 1 ● q = price:[3 TO 12] 0 1 0 0 0 0 1 1 2 1 2 3 2 3 4 5 6 3 4 7 8 5 9 10 6 11 12 13
  • 30. Numeric Range Query (How?) So, the questions is: How to create query for the algorithm?
  • 31. Numeric Range Query (How?) Let’s come back to TrieField#getRangeQuery(...) There are several options: ● field is multiValued, hasDocValues, not indexed ○ super#getRangeQuery ● field is hasDocValues, not indexed ○ new ConstantScoreQuery ( FieldCacheRangeFilter.newIntRange(...) ) ● otherwise ta-da ○ NumericRangeQuery.newIntRange(...)
  • 32. Numeric Range Query (How?) NumericRangeQuery extends MultiTermQuery which is: An abstract Query that matches documents containing a subset of terms provided by a FilteredTermsEnum enumeration. This query cannot be used directly(abstract); you must subclass it and define getTermsEnum(Terms, AttributeSource) to provide a FilteredTermsEnum that iterates through the terms to be matched.
  • 33. Numeric Range Query (How?) Let’s understand how #getTermsEnum works. Returns new NumericRangeTermsEnum(...) The main part is: NumericUtils.splitIntRange(...)
  • 34. Numeric Range Query (How?) Algorithm uses binary masks very much: for (int shift=0; noRanges(); shift += precisionStep): diff = 1L << (shift + precisionStep); mask = ((1L << precisionStep) - 1L) << shift; diff=2 0 0 1 1 2 3 Diff is distance between upper level neighbors Mask is to check if currentLevel node has nodes lower or upper. (1,3 hasLower, 0,2 hasUpper)
  • 35. Numeric Range Query (How?) hasLower = (minBound & mask) != 0L; hasUpper = (maxBound & mask) != mask; if (hasLower) addRange(builder, valSize, minBound, minBound | mask, shift); if (hasUpper) addRange(builder, valSize, maxBound & ~mask, maxBound, shift);
  • 36. Numeric Range Query (How?) hasLower = (minBound & mask) != 0L; hasUpper = (maxBound & mask) != mask; nextMinBound = (hasLower ? (minBound + diff) : minBound) & ~mask; nextMaxBound = (hasUpper ? (maxBound - diff) : maxBound) & ~mask;
  • 37. Numeric Range Query (How?) // If we are in the lowest precision or the next precision is not available. addRange(builder, valSize, minBound, maxBound, shift); // exit the split recursion loop (FOR)
  • 38. Numeric Range Query (How?) ● ● ● ● ● shift = 0 diff = 0b00000010 = 2 mask = 0b00000001 = 1 hasLower = (3 & 1 != 0)? = true hasUpper = (12 & 1 != 1)? = true ○ addRange 3..(3 | 1) = 3..3 ○ addRange 12..(12 & ~1) = 12..12 ● nextMin = (3 + 2) & ~1 = 4 ● nextMax = (12 - 2) & ~1 = 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13
  • 39. Numeric Range Query (How?) ● ● ● ● ● ● ● ● min:4; max:10 shift = 1 diff = 0b00000100 = 4 mask = 0b00000010 = 2 hasLower = (4 & 2 != 0) ? = false hasUpper = (10 & 2 != 2) ? = false nextMin = min nextMax = max 0 0 1 1 2 3 2 3 4 5 6 4 7 8 5 9 10 6 11 12 13
  • 40. Numeric Range Query (How?) ● ● ● ● ● ● ● ● min:4; max:10 shift = 2 diff = 0b00001000 = 8 mask = 0b00000100 = 4 hasLower = (4 & 4 != 0) ? = true hasUpper = (10 & 4 != 4) ? = true nextMin = (4 + 8) & ~4 = 8 => min > max END nextMax = (10 - 8) & ~4 = 0 => range 1..2 shift = 2 2 3 0 1 0 0 1 1 2 3 2 3 4 5 6 4 7 8 5 9 10 6 11 12 13
  • 41. Numeric Range Query (How?) TestNumericUtils#testSplitIntRange assertIntRangeSplit(lower, upper, precisionStep, expectBounds, shifts) assertIntRangeSplit(3, 12, 1, true, Arrays.asList( -2147483645,-2147483645, // 3,3 -2147483636,-2147483636, // 12,12 536870913, 536870914), // 1, 2 for shift == 2 Arrays.asList(0, 0, 2) ); // Crappy unsigned int conversions are done in the asserts
  • 42. Numeric Range Query (How?) So, NumericTermsEnum generates and remembers all ranges to match.
  • 43. Numeric Range Query (How?) Basically TermsEnum is an Iterator to seek or step through terms in some order. In our case order is: 0 1 2 3 4 5 6 7 8 9 10 11 12 Then (shift = 1): 0 1 2 3 4 5 6 Then (shift = 2) 0 2 1 ... 3 13
  • 44. Numeric Range Query (How?) Actually we have FilteredTermsEnum: 1. Only red terms are accepted by our enumerator 2. If term is not accepted we advance: FilteredTermsEnum#nextSeekTerm(currentTerm) TermsEnum#seekCeil(termToSeek) Seek term depends on currentTerm and generated ranges.
  • 45. Numeric Range Query (How?) Ok, now we have TermsEnum for MiltiTermQuery and enum is able to seek through only those terms which match appropriate sub ranges. The question is how to convert TermsEnum to Query!?
  • 46. Numeric Range Query (How?) The last trick is query#rewrite() method of MultiTermQuery (rewrite is always called on query before performing search): public final Query rewrite(IndexReader reader) { return rewriteMethod.rewrite(reader, this); } Oh, “rewriteMethod” how interesting… It defines how the query is rewritten.
  • 47. Numeric Range Query (How?) There are plenty of different rewrite methods, but most interesting for us are: ● CONSTANT_SCORE_* ○ BOOLEAN_QUERY_REWRITE ○ FILTER_REWRITE ○ AUTO_REWRITE_DEFAULT
  • 48. Numeric Range Query (How?) BOOLEAN_QUERY_REWRITE 1. Collect terms (TermCollector) by using #getTermsEnum(...) 2. For each term create TermQuery 3. return BooleanQuery with all TermQuery as leafs
  • 49. Numeric Range Query (How?) FILTER_REWRITE 1. 2. 3. 4. 5. Get termsEnum by using #getTermsEnum(...) Create FixedBitSet Get DocsEnum for each term Iterate over docs and bitSet.set(docid); return ConstantScoreQuery over filter (bitSet)
  • 50. Numeric Range Query (How?) AUTO_REWRITE_DEFAULT If the number of documents to be visited in the postings exceeds some percentage of the maxDoc() for the index then FILTER_REWRITE is used, otherwise BOOLEAN_REWRITE is used.
  • 51. Agenda: ● .. ● I promised. Precision Step! ● ...
  • 52. Precision step So, what is precision step and how it affects performance? ● Defines how much terms to index for each value ○ Lower step values mean more precisions and consequently more terms in index ○ indexedTermsPerValue = bitsPerVal / pStep ○ Lower precision terms are non unique, so term dictionary doesn’t grow much, however postings file does
  • 53. Precision step So, what is precision step and how it affects performance? ● ... ○ Smaller precision step means less number of terms to match, which optimizes query speed ○ But more terms to seek in index ○ You can index with a lower precision step value and test search speed using a multiple of the original step value. ○ Ideal step is found by testing only
  • 54. Precision step (Results) According to NumericRangeQuery javadoc: ● Opteron64 machine, Java 1.5, 8 bit precision step ● 500k docs index ● TermRangeQuery in BooleanRewriteMode took about 30-40 seconds ● TermRangeQuery in FilterRewriteMode took about 5 seconds ● NumericRangeQuery took < 100ms
  • 55. Agenda: ● What is RangeQuery ● Which field type to use for Numerics ● Range stuff under the hood (run!) ● NumericRangeQuery ● Useful links
  • 56. Useful links ● http://searchhub.org/2009/05/13/exploringlucene-and-solrs-trierange-capabilities/ ● http://www.panfmp.org/ ● http://epic.awi.de/17813/1/Sch2007br.pdf ● http://lucene.apache. org/core/4_3_1/core/org/apache/lucene/search/ NumericRangeQuery.html ● http://en.wikipedia.org/wiki/Range_tree ● me http://plus.google.com/+VadimKirilchuk