2. Near
Real
8me
Indexing
Building
Real
Time
Search
Index
For
E-‐Commerce
Umesh
Prasad
Tech
Lead
@
Flipkart
Thejus
V
M
Data
Architect
@
Flipkart
3. Agenda
• Search
@
Flipkart
• Need
for
Real
Time
Search
• SolrCloud
Solu;on
• Our
approach
• Q
&
A
4.
5.
6. Traffic
@
Flipkart
• Peak
Traffic
– ~
800K
ac;ve
users
– ~
160K
requests
per
second
• Search
Traffic
– ~
40K
searches
per
second
(Service)
– ~
10K
searches
per
second
(Solr
)
• Latency
–
Median
:
11
ms
–
99th
percen;le
:
1.1
second
7. Search
@
Flipkart
• Catalogue
– ~
50
main
categories
– ~
5000
sub-‐categories
– ~
231
million
documents
– ~
90
million
SKUs
– ~
160
million
lis;ngs
• E-‐commerce
Marketplace
– ~
100K
Sellers
– Local
Sellers
– Regional
Availability
– Logis;cs
Constraints
8. E-‐commerce
Search
• Heavy
usage
of
drill
down
filters
• Heavy
usage
of
face;ng
• Only
top
results
maer
• Results
grouped/collapsed
by
products
• Serviceability
and
delivery
experience
MATTERS
9. Agenda
• Search
@
Flipkart
• Need
for
Real
Time
Search
• SolrCloud
Solu;on
• Our
approach
• Q
&
A
13. Product
/Lis;ng:
Important
Aributes
Seller
Ra;ng
Service
catalogue
service
Promise
Service
Availability
Service
Offer
Service
Pricing
Service
Product
aka
SKU
Lis;ngs
15. Out
Of
Stock,
but
Why
Show?
Index has Stale
Availability Data
234K
Products
16. Challenge
1
:
High
Update
Rates
updates
/
sec
updates
/hr
normal
Peak
text
/
catalogue
~10
~100
~100K
pricing
~100
~1K
~10
million
availability
~100
~10K
~10
million
offer
~100
~10K
~10
million
seller
ra8ng
~10
~1K
~1
million
signal
6
~10
~100
~1
million
signal
7
~100
~10K
~10
million
signal
8
~100
~10K
~10
million
18. Agenda
• Search
@
Flipkart
• Need
for
Real
Time
Search
• SolrCloud
Solu;on
• Our
approach
• Q
&
A
19. SolrCloud
for
NRT
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Ingestion pipeline Shard
Leader
Auto commit
Soft Commit
Batch of
documents
For Document
Versioning
Update Log
Forward to Replica
20. SolrCloud
Evalua;on
• Update
=
Delete
+
Add
– Block
Join
Index
Update
Whole
Block
(Product
+
Lis;ngs)
• Updated
Document
gets
streamed
to
all
replicas
in
sync
– Reduces
indexing
throughput
• Sol
commit
is
Not
Free
– Sol
commit
In
Memory
Segment
– Lots
of
Merges
– Huge
document
churn
/
deletes
– All
caches
s;ll
need
to
be
re-‐generated
– Filter
Cache
miss
specially
hurts
performance
21. Agenda
• Search
@
Flipkart
• Need
for
Real
Time
Index
• SolrCloud
Solu;on
• Our
approach
• Q
&
A
22. ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : T
price : 23000
ProductC
brand : Apple
availability : F
price : 5000
Document ID
Mappings
Posting List
(Inverted Index)
DocValues
(columunar data)
Lucene Segment
Lucene
Index
0 ProductA
1 ProductB
2 ProductC
45000 23000 5000Price
availability : T
brand : Samsung
brand : Apple 0 , 2
1
0 , 1
Terms
Sparse
Bitsets
23. A
Typical
Search
Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Other
Components
Lucene Segment
Inverted Index
Forward Index
NRT Store
samsung mobiles
Offer : exchange offer
price desc
category : mobiles
brand : samsung
Offer : exchange offer
24. NRT
Forward
Index
-‐
Considera;ons
● Lookup
efficiency
– 50th
percen;le
:
~10K
matches
– 99th
percen;le
:
~1
million
matches
● Data
on
Java
heap
– Memory
efficiency
30. Near
Real
Time
Solr
Architecture
Solr
Kafka
Ingestion pipeline
NRT Forward
Index
Ranking
Matching
Faceting
Redis
Bootstrap
NRT Inverted
store
Solr Master
NRT Updates
Lucene Updates
Catalogue
Pricing
Availability
Offers
Seller
Quality
Commit
+
Replicate
+
Reopen
Lucene
Others
31. Accomplishments
• Real
;me
sor;ng
• Real
;me
filtering
:
PostFilter
– Higher
latency
• Near
real
;me
filtering
:
cached
DocIdSet
– No
consistency
between
lookup
and
filtering
• Independent
of
lucene
commits
• Query
latency
comparable
to
DocValues
– Consistent
99%
performance
32. Accomplishments
@
Flipkart
● Real
;me
consump;on
for
~150
Signals
● Reduc;on
in
shown
out
of
stock
products
by
2X
● Produc;on
instances
of
~50K
updates/second
real
;me