Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Solr at Scale for Time-Oriented Data
Brett Hoerner
@bretthoerner
Senior Platform Engineer, Rocana

3
• Local to Austin, TX
• Have used Solr(Cloud) since 4.0 (2012)
• Not a contributor, just a user
• Work for startups, typically focused on
scalability & performance
• Generally (have to) handle operations in
addition to development
01

4
• "Tuning Solr for Logs" 
 
Radu Gheorghe's talk at 
Lucene/Solr Revolution 2014 
 
bit.ly/tuning-solr-for-logs
02
Quick plug

5
• SaaS social media marketing research tool
• Access to full ﬁrehose for multiple networks
• Example SolrCloud collection: 
~150+ billion documents spanning 1 year 
~10k writes/second 
~45-65 ﬁelds per document 
~800 shards 
On 13 machines in EC2 
Engineering+Operations team of 1-2
02
Spredfast

8
• (Ro)ot (Ca)use A(na)lysis for complex IT
operations (large datacenters)
• On-premises enterprise software (not SaaS)
• Monitors 10s or 100s of thousands of machines
• Customers care about 1TB/day on the low end
• Hadoop ecosystem
02
Rocana

10
• Each social post or log line becomes a Solr doc
• Almost always sort on time ﬁeld (not TF-IDF)
• Queries almost always include facets
• Queries always include a time range 
"last 30 minutes" 
"last 30 days" 
"December 2014"
02
Time-Oriented Realtime Search

11
• Typically a part of larger stream processing
system
• Kafka, or something like it, is recommended
02
Time-Oriented Realtime Search
Firehose
Firehose
Firehose
Kafka
Sold
Indexer
Sold
Indexer
Solr
Indexer
KafkaKafka
SolrSolrSolrSolrSolrSolr
S3 Writer S3

12
• Adjust... 
 
JVM heap (up to ~30GB) 
 
ramBufferSizeMB (up to ~512MB) 
 
solr.autoCommit.maxTime (multiple minutes) 
(and autoCommit openSearcher = false) 
solr.autoSoftCommit.maxTime (as high as possible) 
 
mergeFactor
• Batch writes! (by count and time)
02
Optimizing indexing

13
• DocValues on any ﬁeld you sort/facet on
• Warm on most common sort (time)
• Small ﬁlterCache, only use for time range 
 
fq=ts:[1444755392 TO 1444841789] 
q=text:happy+birthday 
 
OR at least cache separately 
 
fq=ts:[1444755392 TO 1444841789] 
fq=text:happy+birthday 
q=*:*
02
Optimizing queries

14
• By default, Solr hashes the unique ﬁeld* of
each document to decide which shard it
belongs on. 
 
* uniqueKey in schema.xml
• The effect is that documents are evenly spread
across *all* shards
02
Sharding by time

15
• This means every shard is actively writing and
merging new segments all the time
• Your docs/sec per node is docs/nodes, which is
spreading writes pretty thin if you're thinking
of using, say, 500 shards
02
Sharding by time

16
• Even worse, on the read side this means
*every* query must be sent to *every* shard 
 
(unless you're looking for a document by its unique ﬁeld, which is a pretty poor use
case for Solr...)
• Given 1 query and 500 shards: 
 
q=text:happy+timestamp:[37 TO 286]&sort=timestamp desc&rows=100 
 
sends 500 requests out 
searches/sorts your *entire* data set 
waits for 500 responses 
merges them 
and ﬁnally responds
02
Sharding by time

17
• The solution is to take full control of document
routing 
 
/admin/collections? 
action=CREATE 
&name=my_collection 
&router.name=implicit 
&shards=1444780800,1444867200,1444953600,...
02
Sharding by time

18
02
Sharding by time
1444780800 1444867200 1444953600 ...
my_collection
Kafka Solr WriterSolr WriterSolr Writer
{
id: "event100",
body: "hello, world",
created_at: 1444965428,
_route_: 1444953600
}

19
02
Sharding by time
1444780800 1444867200 1444953600 ...
my_collection
/solr/my_collection/select? 
q=text:hello 
&fq=created_at:[1444874953 TO 1444989225] 
&shards=1444867200,1444953600

20
• Duplicate cluster that only holds more recent
data
• ... but with more hardware per document
03
Cluster "layering"
12 months of data
30 days of
data
Query for "last hour"Query for "last June"

21
• bit.ly/created-at-hack
• If we can make assumptions about what's in
each shard, we can optimize the "sub" queries
that are sent to each node
• Also optionally disable facet reﬁnement
01
Hacks

22
• Solr on HDFS is one interesting option 
Can recover existing distributed indexes on another node (using the *same*
directory!), see "autoAddReplicas" in Collection API CREATE.
• "Normal" replication was historically an issue
(for us) at scale
• Apparently made 100% faster in Solr 5.2
• Remember that replicas aren't backups
01
Replication

23
• So, you have your >100 billion document
cluster running...
• Indexes are slowly created over the course of
months/years by ingesting realtime data...
01

24
• But what if... 
 
We need to add new fields (to old docs) 
We need to remove unused fields 
We need to change fields (type, content) 
We decide we need to query further in the past 
We have catastrophic data loss 
We want to upgrade Solr (with no risk)
01

25
• Let's say: 
 
We index 5k/docs sec for a year 
That means 157,680,000,000 documents 
 
Say the cluster can ingest 50k/sec max 
It'd take 36.5 days to reindex a year 
... for any/every change 
... if nothing went wrong for 36.5 days 
... and you need to write the code to do it
01
Timebomb

26
• Hadoop to the rescue (?)
• Under Solr contrib 
github.com/apache/lucene-solr/tree/trunk/solr/contrib/map-reduce
• Given raw input data*, run a MapReduce job
that generates Solr indexes (locally!) 
 
* this is one good reason to use something like Kafka and push all your raw data to
HDFS/S3/etc in addition to Solr
01
MapReduceIndexerTool

27
• Amazon ElasticMapReduce works well for this 
Plus, you can use spot instances (cheap!)
• The trick is, you have to load the completed
indexes yourself 
At that point it becomes an Ops problem, some kind of orchestration like Chef
comes in handy here, but it's not done for you or open-source (yet?)
• Unless you run Solr on HDFS (GoLive) 
01

28
• ~150 billion document collection spanning 1
year reindexed from scratch and running on a
new cluster in ~6 days for ~$3k 
 
Bug/bribe Adam McElwee to open source :) twitter.com/txlord
01

29
• Optimize like you would any Solr cluster
• Reduce caching, RAM is probably scarce and
hits are probably low
• Shard based on time
• Be prepared to rebuild the entire collection so
you can iterate on product/design
01
Conclusion

30
• Thanks! 
 
brett@bretthoerner.com 
 
twitter.com/bretthoerner 
 
rocana.com/careers
01
Fin

Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

Similar to Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana