Imagine the frustration of the user, when they found their perfect wish while browsing, only to realize it later (when they clicked it) that it was out of stock or the price switched or it was not delivered at their location. This happens when the search index doesn’t have the real-time availability, price and seller information. Hence it is a core challenge that an E-Commerce marketplace search engine has to solve. Regular document search index technologies (like Solr/Lucene) have trouble dealing with attributes which are in high constant flux (like availability, price) which are typically seller/listing specific attributes. In this talk, we present the challenges and our solutions for a customized search index for e-commerce addressing these challenges.
6. Out Of Stock, but Why Show?
Index has Stale
Availability Data
234K
Products
7. Outline
❏ E-commerce search Challenge
❏ Challenges in Keeping an Inverted Index Updated
❏ Our approach to Near Real Time indexing
8. Challenge 1 : Update rates
updates / sec
max update
/hr
min max
text /
catalogue ~10 ~100 ~100K
pricing ~100 ~1K ~10 million
availability ~100 ~10K ~10 million
offer ~100 ~10K ~10 million
seller
rating ~10 ~1K ~1 million
signal 6 ~10 ~100 ~1 million
signal 7 ~100 ~10K ~10 million
signal 8 ~100 ~10K ~10 million
9. Challenge 2 : Lucene Index Update
● Lucene doesn’t support Partial Updates.
● Update = Delete Old Doc + Add New Document
– Recreate the entire document for every update
– Not friendly with multiple micro-services with
different update rates
● Problem Compounded By MarketPlace
● Product + All Its Listings == SINGLE BLOCK
● BLOCK structure chosen for query performance ( ~100X
better latencies)
10. Challenge 3 : Refresh Cycle
Ingestion pipeline Solr
Master
Solr
Slave
Solr
Slave
Solr
Slave
Solr
Slave
Solr
Slave
Solr
Slave
Commit
fsync
Replication
Open new
Index
Open new
Index
Open new
Index
Open new
Index
Open new
Index
Open new
Index
Batch of
documents
11. ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : F
price : 23000
ProductC
brand : Apple
availability : T
price : 5000
Document ID
Mappings
Posting List
(Inverted Index)
DocValues
(columunar data)
Lucene Segment
Lucene Index
0 ProductA
1 ProductB
2 ProductC
45000 23000 5000Price
availability : T
brand : Samsung
brand : Apple 0 , 1
2
0 , 2
Terms Sparse
Bitsets
12. Root Cause :Updating Data Structures
Term 3 Bitset 3
POSTING LIST
……………
…………...
Millions of Terms
BitSet 1Term 1
BitSet 2Term 2
BitSet 3Term 3
Document
Term1 Term2
Term3 Term4
……………
…………...
Thousands of Terms
Posting List / Bit Set
D : 0 1 0 0 0 0 1 0 0 0 0 0 0 1
S: 2,7,14
SE : 2,5,7
Yes
May Be
NO
Updatable ?
Millions of
Documents
13. Outline
❏ E-commerce search Challenge
❏ Challenges in Keeping an Inverted Index Updated
❏ Our approach to Near Real Time indexing
14. A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Other
Components
Lucene Segment
Inverted Index
Forward Index
NRT Store
15. NRT Forward Index - Considerations
● Lookup efficiency
– 50th percentile : ~10K matches
– 99th percentile : ~1 million matches
● Data on Java heap
– Memory efficiency
● Hook it to Lucene
22. A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Schema
Other
Components
Lucene Index
Inverted Index
Forward Index
Schema
NRT Store
23. Lucene Index
0 availability:true 0,2
1 availability:false 1
0 brand:adidas 0,1
1 brand:nike 2
1 price:230 1
2 price:250 0
term ords Terms
Dictionary
Posting List
(inverted index)
Doc Value
(Forward index)
field 0 1 2
price 2 2 3
brand 0 0 1
availability 0 1 0
docId External ID Brand Availability Price
0 ProductA Adidas True 250
1 ProductB Adidas False 230
2 ProductC Nike True 500
● Lucene Index = Multiple Mini Indexes aka
Segments
● Lucene Segment
○ Write Once → Immutable Data structures
○ Posting Listing ( Sparse encoded bitsets)
○ Doc Values (Columnar Data structures)
24. Lucene Index
0 availability:true 0,2
1 availability:false 1
0 brand:adidas 0,1
1 brand:nike 2
1 price:230 1
2 price:250 0
term ords Terms
Dictionary
Posting List
(inverted index)
Doc Value
(Forward index)
field 0 1 2
price 2 2 3
brand 0 0 1
availability 0 1 0
docId External ID Brand Availability Price
0 ProductA Adidas True 250
1 ProductB Adidas False 230
2 ProductC Nike True 500
● Lucene Index = Multiple Mini Indexes aka
Segments
● Lucene Segment
○ Write Once → Immutable Data structures
○ Posting Listing ( Sparse encoded bitsets)
○ Doc Values (Columnar Data structures)
25. C5 : Lucene in-place update
● Only numeric / byte Array fields
● Updates to go through the entire refresh cycle
● Not exposed via Solr
26. Forward Index - API Hook
● Lucene API Hook
– ValueSource
● Input
– Lucene Internal Document Id
– Field Name
● Output
– Field Value
27. NRT Store - Inverted Index
● Input
– Lucene Segment
– query
• Field Name : Field Value
• offer : o1
● Output
– DocSet (posting list)
Editor's Notes
Going from a Page 1 to Page could be a matter of seconds on Sales Day ( Big Billion Day)
Hierarchical documents ( Product → Listing )
Highly structured
Free Text, Numeric, Tags
Micro services for individual field updates
Different update rates
Independently updating fields
Availabilty has been used in ranking, but it is stale, hence OOS. Explain challenge of 234K
Means, the entire index will be recreated every hour
Product Documents + Seller SKU Documents
block-join index
block : Composite document, with product and all its seller SKU
Con
Any Update = Delete + Recreate entire block
Aggravates Delete + Recreate problem
Remove animation, don’t spend too much time on it.
Posting =
Keep the fast changing data outside of the index
Update this data independent of Solr updates
Hooks in Lucene/Solr for retrieval
ValueSource
Filter
Collector
Explain the API Hook
Lucene APIs : internal document id
Columnar data structures
Implementation dependent on data type
Chosen for memory efficiency
boolean : 1bit
enum : log(#enumerations) bits
int : 4 bytes
multi val : array of the above data structures
Filter API of lucene
DocIdSet getDocIdSet(LuceneIndex)
Invert data to adhere to lucene’s internal order at regular intervals of time