Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Slash n near real time indexing

Imagine the frustration of the user, when they found their perfect wish while browsing, only to realize it later (when they clicked it) that it was out of stock or the price switched or it was not delivered at their location. This happens when the search index doesn’t have the real-time availability, price and seller information. Hence it is a core challenge that an E-Commerce marketplace search engine has to solve. Regular document search index technologies (like Solr/Lucene) have trouble dealing with attributes which are in high constant flux (like availability, price) which are typically seller/listing specific attributes. In this talk, we present the challenges and our solutions for a customized search index for e-commerce addressing these challenges.

  • Login to see the comments

Slash n near real time indexing

  1. 1. A real time search index for e-commerce Umesh Prasad Thejus V M
  2. 2. Oh!! Out Of Stock
  3. 3. Damn !! Out of Stock
  4. 4. Damn !! Missed the Offer
  5. 5. E-commerce Index Attributes catalogue service Promise Engine Availability Service Seller Rating LISTING PRODUCT aka SKU Offer Engine Pricing Engine
  6. 6. Out Of Stock, but Why Show? Index has Stale Availability Data 234K Products
  7. 7. Outline ❏ E-commerce search Challenge ❏ Challenges in Keeping an Inverted Index Updated ❏ Our approach to Near Real Time indexing
  8. 8. Challenge 1 : Update rates updates / sec max update /hr min max text / catalogue ~10 ~100 ~100K pricing ~100 ~1K ~10 million availability ~100 ~10K ~10 million offer ~100 ~10K ~10 million seller rating ~10 ~1K ~1 million signal 6 ~10 ~100 ~1 million signal 7 ~100 ~10K ~10 million signal 8 ~100 ~10K ~10 million
  9. 9. Challenge 2 : Lucene Index Update ● Lucene doesn’t support Partial Updates. ● Update = Delete Old Doc + Add New Document – Recreate the entire document for every update – Not friendly with multiple micro-services with different update rates ● Problem Compounded By MarketPlace ● Product + All Its Listings == SINGLE BLOCK ● BLOCK structure chosen for query performance ( ~100X better latencies)
  10. 10. Challenge 3 : Refresh Cycle Ingestion pipeline Solr Master Solr Slave Solr Slave Solr Slave Solr Slave Solr Slave Solr Slave Commit fsync Replication Open new Index Open new Index Open new Index Open new Index Open new Index Open new Index Batch of documents
  11. 11. ProductA brand : Apple availability : T price : 45000 ProductB brand : Samsung availability : F price : 23000 ProductC brand : Apple availability : T price : 5000 Document ID Mappings Posting List (Inverted Index) DocValues (columunar data) Lucene Segment Lucene Index 0 ProductA 1 ProductB 2 ProductC 45000 23000 5000Price availability : T brand : Samsung brand : Apple 0 , 1 2 0 , 2 Terms Sparse Bitsets
  12. 12. Root Cause :Updating Data Structures Term 3 Bitset 3 POSTING LIST …………… …………... Millions of Terms BitSet 1Term 1 BitSet 2Term 2 BitSet 3Term 3 Document Term1 Term2 Term3 Term4 …………… …………... Thousands of Terms Posting List / Bit Set D : 0 1 0 0 0 0 1 0 0 0 0 0 0 1 S: 2,7,14 SE : 2,5,7 Yes May Be NO Updatable ? Millions of Documents
  13. 13. Outline ❏ E-commerce search Challenge ❏ Challenges in Keeping an Inverted Index Updated ❏ Our approach to Near Real Time indexing
  14. 14. A Typical Search Flow Query Rewrite Results Query Matching Ranking Faceting Stats Posting List Doc Values Other Components Lucene Segment Inverted Index Forward Index NRT Store
  15. 15. NRT Forward Index - Considerations ● Lookup efficiency – 50th percentile : ~10K matches – 99th percentile : ~1 million matches ● Data on Java heap – Memory efficiency ● Hook it to Lucene
  16. 16. NRT Store - Forward Index Naive NRT Forward IndexLucene Segment Lookup Engine 0 ProductB 1 ProductA 2 ProductC 3 ProductD ProductC ProductA ProductB ProductC ProductD True False False True 100 150 200 250 ProductId(3) <ProductC,price> DocId : 3 field : price 200 ProductId Availability Price Latency : ~10 secs for ~1 Million lookups
  17. 17. NRT Store - Forward Index Optimized NRT Forward Index (Segment Independent) Lucene Segment Lookup Engine 0 ProductB 1 ProductA 2 ProductC 3 ProductD 100 200 250 150 NrtId(3) 2 DocId : 3 field : price 200 Availability Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB T F F T DocId - NrtId 0 1 2 3 3 0 1 2 Price(2 ) 200
  18. 18. NRT Store - Invert index NRT Forward Store NRT Invert Store NRT Inverter Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD Availability : T 0 3 Offer : O1 2 3 Availability:T Matching BitSet
  19. 19. Near Real Time Solr Architecture Solr Kafka Ingestion pipeline NRT Forward Index Ranking Macthing Faceting Redis Bootstrap NRT Inverted store Solr Master NRT Updates Text Updates Catalogue Pricing Availability Offers Seller Quality Commit + Replicate + Reopen Lucene Others
  20. 20. Accomplishments ● Real time consumption for Ranking Signals ● BBD saw upto ~30K updates/second ● Query latency comparable to DocValues – Consistent 99% performance
  21. 21. Thank you & Questions
  22. 22. A Typical Search Flow Query Rewrite Results Query Matching Ranking Faceting Stats Posting List Doc Values Schema Other Components Lucene Index Inverted Index Forward Index Schema NRT Store
  23. 23. Lucene Index 0 availability:true 0,2 1 availability:false 1 0 brand:adidas 0,1 1 brand:nike 2 1 price:230 1 2 price:250 0 term ords Terms Dictionary Posting List (inverted index) Doc Value (Forward index) field 0 1 2 price 2 2 3 brand 0 0 1 availability 0 1 0 docId External ID Brand Availability Price 0 ProductA Adidas True 250 1 ProductB Adidas False 230 2 ProductC Nike True 500 ● Lucene Index = Multiple Mini Indexes aka Segments ● Lucene Segment ○ Write Once → Immutable Data structures ○ Posting Listing ( Sparse encoded bitsets) ○ Doc Values (Columnar Data structures)
  24. 24. Lucene Index 0 availability:true 0,2 1 availability:false 1 0 brand:adidas 0,1 1 brand:nike 2 1 price:230 1 2 price:250 0 term ords Terms Dictionary Posting List (inverted index) Doc Value (Forward index) field 0 1 2 price 2 2 3 brand 0 0 1 availability 0 1 0 docId External ID Brand Availability Price 0 ProductA Adidas True 250 1 ProductB Adidas False 230 2 ProductC Nike True 500 ● Lucene Index = Multiple Mini Indexes aka Segments ● Lucene Segment ○ Write Once → Immutable Data structures ○ Posting Listing ( Sparse encoded bitsets) ○ Doc Values (Columnar Data structures)
  25. 25. C5 : Lucene in-place update ● Only numeric / byte Array fields ● Updates to go through the entire refresh cycle ● Not exposed via Solr
  26. 26. Forward Index - API Hook ● Lucene API Hook – ValueSource ● Input – Lucene Internal Document Id – Field Name ● Output – Field Value
  27. 27. NRT Store - Inverted Index ● Input – Lucene Segment – query • Field Name : Field Value • offer : o1 ● Output – DocSet (posting list)

×