What is in a Lucene index?

WHAT IS IN A LUCENE INDEX
Adrien Grand
@jpountz

Software engineer at Elasticsearch

About me
•
•

Lucene/Solr committer
Software engineer at Elasticsearch

•

I like changing the index file formats!
– stored fields
– term vectors
– doc values
– ...

Why should I
learn about
Lucene internals?

Why should I learn about Lucene internals?
•

Know the cost of the APIs
– to build blazing fast search applications
– don’t commit all the time
– when to use stored fields vs. doc values
– maybe Lucene is not the right tool

•

Understand index size
– oh, term vectors are 1/2 of the index size!
– I removed 20% of my documents and index size hasn’t changed

•

This is a lot of fun!

Indexing
•

Make data fast to search
– duplicate data if it helps
– decide on how to index based on the queries

•

Trade update speed for search speed
– Grep vs full-text indexing
– Prefix queries vs edge n-grams
– Phrase queries vs shingles

•

Indexing is fast
– 220 GB/hour for 4K docs!
– http://people.apache.org/~mikemccand/lucenebench/indexing.html

Let’s create an index
•

Tree structure
– sorted for range queries
– O(log(n)) search

sql
index

data

term

Lucene

Lucene in action
Databases

Lucene doesn’t
work this way

Another index
•

Store terms and documents in arrays
– binary search

0

data

0,1

1

index

0,1

2

Lucene

0

3

term

0

4

sql

1

0

Lucene in action

1

Databases

Another index
•

Store terms and documents in arrays
– binary search

0

0,1

1

Segment

data
index

0,1

2

Lucene

0

3

term

0

4

sql

1

term
ordinal

terms
dict

postings
list

0

Lucene in action

1

Databases

doc id

document

Insertions?
•
•

Insertion = write a new segment
Merge segments when there are too many of them
– concatenate docs, merge terms dicts and postings lists (merge sort!)
0

data

0

1

index

0

2

Lucene

0

term

0

0

data

0

1

index

0

2

sql

0

0

Databases

1

index

0,1

Lucene

0

term

0

4

Lucene in action

0,1

2

0

data

3

3

0

sql

1

0

Lucene in action

1

Databases

Insertions?
•
•

Insertion = write a new segment
Merge segments when there are too many of them
– concatenate docs, merge terms dicts and postings lists (merge sort!)
0

data

0

1

index

0

2

Lucene

0

term

0

0

data

1

1

index

1

2

sql

1

1

Databases

1

index

0,1

Lucene

0

term

0

4

Lucene in action

0,1

2

0

data

3

3

0

sql

1

0

Lucene in action

1

Databases

Deletions?
•
•
•

Deletion = turn a bit off
Ignore deleted documents when searching and merging (reclaims space)
Merge policies favor segments with many deletions

0

data

0,1

1

index

0,1

2

Lucene

0

3

term

0

4

sql

1

0

Lucene in action

1

1

Databases

0

live docs: 1 = live, 0 = deleted

Pros/cons
•

•

•
•

•

Updates require writing a new segment
– single-doc updates are costly, bulk updates preferred
– writes are sequential
Segments are never modified in place
– filesystem-cache-friendly
– lock-free!
Terms are deduplicated
– saves space for high-freq terms
Docs are uniquely identified by an ord
– useful for cross-API communication
– Lucene can use several indexes in a single query
Terms are uniquely identified by an ord
– important for sorting: compare longs, not strings
– important for faceting (more on this later)

Lucene can use
several indexes
Many databases can’t

Index intersection
1

red
shoe

2

4

6

7

9

1, 2, 10, 11, 20, 30, 50, 100
2, 20, 21, 22, 30, 40, 100
3

5

8

Lucene’s postings lists support skipping that
can be use to “leap-frog”
Many databases just pick the most selective
index and ignore the other ones

What else?
•
•

We just covered search
Lucene does more
– term vectors
– norms
– numeric doc values
– binary doc values
– sorted doc values
– sorted set doc values

Term vectors
•
•
•

Per-document inverted index
Useful for more-like-this
Sometimes used for highlighting
0

Lucene in action

0

data

0

0

data

0,1

1

index

0

1

index

0,1

2

Lucene

0

2

Lucene

0

3

term

0

3

term

0

0

data

0

4

sql

1

1

index

0

2

sql

0

1

Databases

Numeric/binary doc values
•
•
•

Per doc and per field single numeric values, stored in a column-stride fashion
Useful for sorting and custom scoring
Norms are numeric doc values
ﬁeld_a ﬁeld_b
0

Lucene in action

42

afc

1

Databases

1

gce

2

Solr in action

3

ppy

3

Java

10

ccn

Sorted (set) doc values
•

Ordinal-enabled per-doc and per-field values
– sorted: single-valued, useful for sorting
– sorted set: multi-valued, useful for faceting

0

Lucene in action

1,2

0

distributed

1

Databases

0

1

Java

2

Solr in action

0,1,2

2

search

3

Java

1

Ordinals

Terms dictionary for
this dv ﬁeld

Faceting
•

Compute value counts for docs that match a query
– eg. category counts on an ecommerce website

•

Naive solution
– hash table: value to count
– O(#docs) ordinal lookups
– O(#doc) value lookups

•

2nd solution
– hash table: ord to count
– resolve values in the end
– O(#docs) ordinal lookups
– O(#values) value lookups

Since ordinals are dense,
this can be a simple array

How can I use these APIs?
•

These are the low-level Lucene APIs, everything is built on top of these APIs:
searching, faceting, scoring, highlighting, etc.
API

Useful for

Method

Inverted index

Term -> doc ids, positions,
offsets

AtomicReader.ﬁelds

Stored ﬁelds

Summaries of search results

IndexReader.document

Live docs

Ignoring deleted docs

AtomicReader.liveDocs

Term vectors

More like this

IndexReader.termVectors

Doc values / Norms

Sorting/faceting/scoring

AtomicReader.get*Values

Wrap up
•

•

Data duplicated up to 4 times
– not a waste of space!
– easy to manage thanks to immutability
Stored fields vs doc values
– Optimized for different access patterns
– get many field values for a few docs: stored fields
– get a few field values for many docs: doc values

Stored fields

0,A

0,B

0,C

Doc values

0,A

1,A

2,A

0,B

1,B

2,B

0,B

1,B

2,B

1,A

1,B

1,C

2,A

2,B

2,C

At most 1 seek per doc
At most 1 seek per doc per field
BUT more disk / file-system cache-friendly

Important rules
•

Save file handles
– don’t use one file per field or per doc

•

Avoid disk seeks whenever possible
– disk seek on spinning disk is ~10 ms

•

BUT don’t ignore the filesystem cache
– random access in small files is fine

•

Light compression helps
– less I/O
– smaller indexes
– filesystem-cache-friendly

Codecs
•

File formats are codec-dependent

•

Default codec tries to get the best speed for little memory
– To trade memory for speed, don’t use RAMDirectory:
– MemoryPostingsFormat, MemoryDocValuesFormat, etc.

•

Detailed file formats available in javadocs
– http://lucene.apache.org/core/4_5_1/core/org/apache/lucene/codecs/packagesummary.html
–

Compression techniques
•

Bit packing / vInt encoding
– postings lists
– numeric doc values

•

LZ4
– code.google.com/p/lz4
– lightweight compression algorithm
– stored fields, term vectors

•

FSTs
– conceptually a Map<String, ?>
– keys share prefixes and suffixes
– terms index

What happens
when I run a
TermQuery?

1. Terms index
•

Lookup the term in the terms index
– In-memory FST storing terms prefixes
– Gives the offset to look at in the terms dictionary
– Can fast-fail if no terms have this prefix

r

b/2
l/4

a/1

c

u
y/3

r

br = 2
brac = 3
luc = 4
lyr = 7

2. Terms dictionary
•

•

Jump to the given offset in the terms dictionary
– compressed based on shared prefixes, similarly to a burst trie
– called the “BlockTree terms dict”
read sequentially until the term is found
–

Jump here
Not found
Not found
Found

[preﬁx=luc]
a, freq=1, offset=101
as, freq=1, offset=149
ene, freq=9, offset=205
ky, frea=7, offset=260
rative, freq=5, offset=323

3. Postings lists
•
•

Jump to the given offset in the postings lists
Encoded using modified FOR (Frame of Reference) delta
– 1. delta-encode
– 2. split into block of N=128 values
– 3. bit packing per block
– 4. if remaining docs, encode with vInt

Example with N=4

1,3,4,6,8,20,22,26,30,31
1,2,1,2,2,12,2,4,4,1
[1,2,1,2] [2,12,2,4] 4, 1

2 bits per value

vInt-encoded

4 bits per value

4. Stored fields
•

•

In-memory index for a subset of the doc ids
– memory-efficient thanks to monotonic compression
– searched using binary search
Stored fields
– stored sequentially
– compressed (LZ4) in 16+KB blocks
docId=3
offset=127

docId=0
offset=42

0

1
16KB

2

docId=4
offset=199

3
16KB

4

5
16KB

6

Query execution
•
•

2 disk seeks per field for search
1 disk seek per doc for stored fields

•

It is common that the terms dict / postings lists fits into the file-system cache

•

“Pulse” optimization
– For unique terms (freq=1), postings are inlined in the terms dict
– Only 1 disk seek
– Will always be used for your primary keys

What is happening here?
qps

1
2

#docs in the index

qps

1

Index grows larger than the ﬁlesystem
cache: stored ﬁelds not fully in the cache
anymore

2

#docs in the index

qps

1

Index grows larger than the ﬁlesystem
cache: stored ﬁelds not fully in the cache
anymore

2 Terms dict/Postings lists not fully in the
cache

#docs in the index

What is in a Lucene index?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to What is in a Lucene index?

Similar to What is in a Lucene index? (20)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

What is in a Lucene index?