Solr Search Engine: Optimize Is (Not) Bad for You

Optimize Is (Not) Bad For You
Deep Dive Into The Segment Merge Abyss
Rafał Kuć
Sematext Group, Inc.

Agenda
• Segments – where, what & how
• Writing segments
• Modifying segments
• Segment merging – what, where, how, why
• Force merging
• Force merging & SolrCloud
• Performance considerations
• Specialized merge policies
https://github.com/sematext/lr/tree/master/2017/optimize

3
01
Sematext & I
cloud
metrics
logs
&

4
01
Solr Collection Architecture
Zookeeper

5
01
Zookeeper
SOLR
SOLR
SOLR
SOLR

6
01
Zookeeper
SOLR
shard shard
SOLR
shard shard
SOLR
shard shard
SOLR
shard shard

7
01
Solr Shard Architecture
TLOG

8
01
Solr Shard Architecture
TLOG
Segment Segment Segment
Segment

9
01
Lucene Segment
Segment Info
Field Names
Stored Field Values
Point Values
Term Dictionary
Term Frequency
Term Proximity
Normalization
Per Document Vals
Live Documents

1
01
Inside the Segment – Term Dictionary
TERM DOCID
lucene <1>, <2>
revolution <1>, <2>
washington <1>
boston <2>
_1.tim
Doc1 Title: Lucene Revolution Washington, City: Washington D.C
Doc2 Title: Lucene Revolution Boston, City: Boston
_1.tip

1
01
Inside the Segment – Doc Values
DOCID FIELD VALUE
1 Title Lucene Revolution Washington
1 City Washington D.C.
2 Title Lucene Revolution Boston
2 City Boston
_1.dvd
_1.dvm

1
01
Inside the Segment – Stored Fields
DOCID VALUE
1 Title: Lucene Revolution Washington
City: Washington D.C
2 Title: Lucene Revolution Boston
City: Boston
_1.fdx
_1.fdt

1
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm

1
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm
_2.cfs
_2.cfe

2
01
Atomic Updates
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
retrieve document
{
"id" : 3,
"tags" : [ "lucene" ],
"awesome" : true
}

3
01
Atomic Updates
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true
}
apply changes

3
01
Atomic Updates
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
{
"id" : 3,
"awesome" : true
}
delete old document

3
01
Atomic Updates
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
{
"id" : 3,
"awesome" : true
}

3
01
Atomic Updates – In Place
Works on top of numeric, doc values based fields
Fields need to be not indexed and not stored
Doesn’t require delete/index
Support only inc and set modifers
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'

3
01
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
retrieve document
{
"id" : 3,
"awesome" : true
}

3
01
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
{
"id" : 3,
"awesome" : true,
"views" : 100
}
apply changes

3
01
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
{
"id" : 3,
"awesome" : true,
"views" : 100
}
update doc values

3
01
Search – Importance of Segments
Immutable – write once read many

3
01
More segments – slower search speed

3
01
Fewer segments – faster searches

4
01
Fewer segments – smaller shard size

4
01
Fewer segments – smaller shard size
Rapid segment changes – worse I/O cache usage

4
01
Taking Control
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>

4
01
Taking Control
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />

4
01
Taking Control
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />
Segment Warmer
<mergedSegmentWarmer
class="org.apache.lucene.index.SimpleMergedSegmentWarmer" />

4
01
Taking Control – Default Indexing Throughput

4
01
Taking Control – Default Indexing Throughput
throughput < 5k/sec @ ~14GB

4
01
Taking Control – Max Merged Segment Size
Lower higher indexing throughput – smaller segments
Higher better search latency (depends) – more merges

4
01
Taking Control – Lowering Max Merged Size

4
01
Taking Control – Lowering Max Segment Size
throughput < 5k/sec @ ~15.5GB
11% throughput increase

5
01
Taking Control – Merge At Once
Lower better search latency (depends)
Higher higher indexing throughput

5
01
Taking Control – Lowering Merge At Once

5
01
Taking Control – Lowering Merge At Once
8% throughput decrease

5
01
Taking Control – Merge At Once Explicit
Controls number of segments merged at once during force merge

5
01
Taking Control – Segments Per Tier
Lower value means more merging, but less segments
Along with maxMergeAtOnce can smoothen I/O spikes
For better indexing throughput set maxMergeAtOnce <
segmentsPerTier

5
01
Taking Control – Combined Together

5
01
Taking Control – Combined Together
but look at read difference

5
01
Taking Control – Default vs Combined Read/Write
default settings

5
01
Taking Control – Default vs Combined Read/Write
default settings combined changes settings

5
01
Taking Control – Reclaim Deletes Weight
Controls importance of merging segments with deleted documents
Increase to put priority on merging segments with deleted documents

6
01
Taking Control – No CFS Ratio
Controls compound file system segments ratio
To completely disable CFS set to 0.0

6
01
Taking Control – Merge Scheduler
Controls maximum number of concurrent merges
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">4</int>
<int name="maxThreadCount">4</int>
</mergeScheduler>

6
01
Controls number of threads dedicated to merging
Merge Scheduler
</mergeScheduler>

6
01
For spinning drives set maxThreadCount to 1
Merge Scheduler
</mergeScheduler>

6
01
For spinning drives set maxThreadCount to 1
For SSD set maxThreadCount to min(4, #CPUs / 2)
Merge Scheduler
</mergeScheduler>

6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive

6
01
Desired number of segments can be specified

6
01
Done on all shards at the same time (by default)

6
01
Can be very bad or very good – depending on the use case

6
01
Can be very bad or very good – depending on the use case
$ curl
'http://solr:8983/solr/lr/update?optimize=true&numSegments=1&waitFlush=false'

7
01
Force Merge – The Good
Improves search speed (fewer segments)

7
01
Removes deleted documents

7
01
Shrinks the index by pruning duplicated data

7
01
Shrinks the index by pruning duplicated data
Reduces number of used files

7
01
Force Merge – The Bad
Invalidates operating system I/O cache

7
01
Very expensive to perform – rewrites all segments

7
01
Not efficient on changing data

7
01
May cause performance issues

7
01
May cause performance issues
Will cause temporary increase of disk usage (up to 3x)

7
01
Force Merge – SolrCloud Performance Example

8
01
Force Merge – SolrCloud Performance Example

8
01
Force Merge – Legacy
Index on the master server
Solr Master
Solr Slave
Solr Slave
Solr Slave
index
Documents

8
01
Force merge on the master server
Solr Master
Solr Slave
Solr Slave
Solr Slave
force merge

8
01
Force merge on the master server
Replicate after optimize is done
Solr Master
Solr Slave
Solr Slave
Solr Slave
pull after optimize

8
01
Force Merge – SolrCloud (Solr 7 – pull replicas)
Create collection
Force merge
Solr will do the rest
Solr Solr
Solr Solr
Primary 1
Primary 2 Pull Replica 2
Pull Replica 1

8
01
Force Merge – SolrCloud (NRT, pre 7.0)
Ask yourself if you really need force merge
Solr Solr
Solr Solr

8
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Create collection on part of the nodes
Solr Solr
Solr Solr
Primary 1
Primary 2

8
01
Index
Solr Solr
Solr Solr
Primary 1
Primary 2
DocumentsDocuments
Documents
Documents

8
01
Index
Force merge
Solr Solr
Solr Solr
Primary 1
Primary 2optimize

8
01
Index
Force merge
Create replicas
Solr Solr
Solr Solr
Primary 1
Primary 2 Replica 2
Replica 1

9
01
Specialized Merge Policy Example – Sorting
Sorting Merge Policy Factory Example
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">timestamp desc</str>
<str name="wrapper.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
<double name="inner.noCFSRatio">0.1</double>

9
01
Specialized Merge Policy Example – Sorting
Sorting Merge Policy Factory Example
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">timestamp desc</str>
<str name="wrapper.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
<double name="inner.noCFSRatio">0.1</double>
Pre-sorts data during merge for:
- faster range queries
- faster data retrieval
- possibility of early query termination
- convenient for time based data

9
01
http://sematext.com/jobs
You love like we do?
You want to work with ?
Want to work with open source?
You want to do fun stuff?

9
01
Get in touch
Rafał
rafal.kuc@sematext.com
@kucrafal
http://sematext.com
@sematext http://sematext.com/jobs
Come talk to us
at the booth

Solr Search Engine: Optimize Is (Not) Bad for You

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Solr Search Engine: Optimize Is (Not) Bad for You

Similar to Solr Search Engine: Optimize Is (Not) Bad for You (20)

More from Sematext Group, Inc.

More from Sematext Group, Inc. (20)

Recently uploaded

Recently uploaded (20)

Solr Search Engine: Optimize Is (Not) Bad for You