SlideShare a Scribd company logo
1 of 33
Download to read offline
OCTOBER	
  11-­‐14,	
  2016	
  	
  •	
  	
  BOSTON,	
  MA	
  
Near	
  Real	
  8me	
  Indexing	
  
Building	
  Real	
  Time	
  Search	
  Index	
  For	
  E-­‐Commerce	
  
	
  
Umesh	
  Prasad	
  
Tech	
  Lead	
  	
  @	
  Flipkart	
  
	
  
Thejus	
  V	
  M	
  
Data	
  Architect	
  @	
  Flipkart	
  
	
  
	
  
Agenda	
  
•  Search	
  @	
  Flipkart	
  
•  Need	
  for	
  Real	
  Time	
  Search	
  
•  SolrCloud	
  Solu;on	
  
•  Our	
  approach	
  
•  Q	
  &	
  A	
  
Traffic	
  @	
  Flipkart	
  
•  Peak	
  Traffic	
  	
  
–  ~	
  800K	
  ac;ve	
  users	
  
–  ~	
  160K	
  	
  requests	
  per	
  second	
  	
  
•  Search	
  Traffic	
  	
  
–  ~	
  40K	
  searches	
  per	
  second	
  (Service)	
  
–  ~	
  10K	
  searches	
  per	
  second	
  (Solr	
  )	
  
•  Latency	
  
–  	
  Median	
  :	
  11	
  ms	
  
–  	
  99th	
  percen;le	
  :	
  1.1	
  second	
  
Search	
  @	
  Flipkart	
  
•  Catalogue	
  	
  
–  ~	
  50	
  main	
  categories	
  
– ~	
  5000	
  sub-­‐categories	
  
– ~	
  231	
  million	
  documents	
  
– ~	
  90	
  million	
  SKUs	
  
– ~	
  160	
  million	
  lis;ngs	
  
	
  
•  E-­‐commerce	
  Marketplace	
  	
  
– ~	
  100K	
  	
  Sellers	
  
– Local	
  Sellers	
  
– Regional	
  Availability	
  
– Logis;cs	
  Constraints	
  	
  
E-­‐commerce	
  Search	
  
•  Heavy	
  usage	
  of	
  drill	
  down	
  filters	
  
•  Heavy	
  usage	
  of	
  face;ng	
  
•  Only	
  top	
  results	
  maer	
  
•  Results	
  grouped/collapsed	
  by	
  products	
  
	
  
•  Serviceability	
  and	
  delivery	
  experience	
  MATTERS	
  	
  
Agenda	
  
•  Search	
  @	
  Flipkart	
  
•  Need	
  for	
  Real	
  Time	
  Search	
  
•  SolrCloud	
  Solu;on	
  
•  Our	
  approach	
  
•  Q	
  &	
  A	
  
Sorry,	
  	
  	
  Stock	
  Over	
  	
  	
  !!?	
  
Damn	
  !!	
  Is	
  Offer	
  Over	
  ??	
  
What	
  !!	
   	
  All	
  Steal	
  Deals	
  Gone	
  ??	
  
Product	
  /Lis;ng:	
  Important	
  Aributes	
  
Seller	
  
Ra;ng	
  
Service	
  
catalogue	
  
service	
  
Promise	
  
Service	
  
Availability
Service
Offer	
  
Service	
  
Pricing	
  
Service	
  
Product	
  aka	
  SKU	
  
Lis;ngs	
  
Summary	
  :	
  	
  Lucene	
  Document	
  
•  Product/SKU	
  	
  (Parent	
  Document)	
  
–  Lis;ng	
  (Child	
  Document)	
  	
  
•  Query	
  :	
  	
  Mostly	
  	
  SKU	
  Aributes	
  	
  	
  	
   	
   	
  (Free	
  Text)	
  
•  Filters	
  :	
  SKU	
  +	
  	
  Lis;ng	
  Aributes	
   	
   	
   	
  (Drill	
  Down)	
  
•  Ranking	
  :	
  SKU	
  +	
  Lis;ng	
  	
  Aributes	
  	
   	
   	
  (Explicit/
Relevance)	
  	
  
•  Index	
  Time	
  Join	
  aka	
  Block	
  Join	
   	
   	
   	
  (Best	
  
Performance)	
  
	
  
	
  
Out	
  Of	
  Stock,	
  but	
  Why	
  Show?	
  
Index has Stale
Availability Data
234K	
  
Products	
  
Challenge	
  1	
  :	
  High	
  Update	
  Rates	
  
updates	
  /	
  sec	
   updates	
  /hr	
  	
  
normal	
   Peak	
  
text	
  /	
  catalogue	
   ~10	
   ~100	
   ~100K	
  
pricing	
   ~100	
   ~1K	
   ~10	
  million	
  
availability	
   ~100	
   ~10K	
   ~10	
  million	
  
offer	
   ~100	
   ~10K	
   ~10	
  million	
  
seller	
  ra8ng	
   ~10	
   ~1K	
   ~1	
  million	
  
signal	
  6	
   ~10	
   ~100	
   ~1	
  million	
  
signal	
  7	
   ~100	
   ~10K	
   ~10	
  million	
  
signal	
  8	
   ~100	
   ~10K	
   ~10	
  million	
  
Challenge	
  2	
  :	
  Micro	
  Services	
  	
  
Ingestion pipeline
Catalogue Pricing Availability Offers ...
Document Builder
Solr/Lucene
Change
Propagation
Documents
{L1,L2 … P1}
Updates Stream 1
Updates Stream 2
Updates Stream 3
●  Lucene doesn’t support Partial Updates
●  Update = Delete + Add
Agenda	
  
•  Search	
  @	
  Flipkart	
  
•  Need	
  for	
  Real	
  Time	
  Search	
  
•  SolrCloud	
  Solu;on	
  
•  Our	
  approach	
  
•  Q	
  &	
  A	
  
SolrCloud	
  for	
  NRT	
  
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Ingestion pipeline Shard
Leader
Auto commit
Soft Commit
Batch of
documents
For Document
Versioning
Update Log
Forward to Replica
SolrCloud	
  Evalua;on	
  
•  Update	
  =	
  Delete	
  +	
  Add	
  
–  Block	
  Join	
  Index	
   	
  Update	
  Whole	
  Block	
  (Product	
  +	
  Lis;ngs)	
  
•  Updated	
  Document	
  gets	
  streamed	
  to	
  all	
  replicas	
  in	
  sync	
  
–  Reduces	
  indexing	
  throughput	
  
•  Sol	
  commit	
  is	
  Not	
  Free	
  
–  Sol	
  commit	
   	
  In	
  Memory	
  Segment	
  
–  Lots	
  of	
  Merges	
  
–  Huge	
  document	
  churn	
  /	
  deletes	
  
–  All	
  caches	
  s;ll	
  need	
  to	
  be	
  re-­‐generated	
  
–  Filter	
  Cache	
  miss	
  specially	
  hurts	
  performance	
  
Agenda	
  
•  Search	
  @	
  Flipkart	
  
•  Need	
  for	
  Real	
  Time	
  Index	
  
•  SolrCloud	
  Solu;on	
  
• Our	
  approach	
  
•  Q	
  &	
  A	
  
ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : T
price : 23000
ProductC
brand : Apple
availability : F
price : 5000
Document ID
Mappings
Posting List
(Inverted Index)
DocValues
(columunar data)
Lucene Segment
Lucene	
  Index	
  
0 ProductA
1 ProductB
2 ProductC
45000 23000 5000Price
availability : T
brand : Samsung
brand : Apple 0 , 2
1
0 , 1
Terms
Sparse
Bitsets
A	
  Typical	
  Search	
  Flow	
  
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Other
Components
Lucene Segment
Inverted Index
Forward Index
NRT Store
samsung mobiles
Offer : exchange offer
price desc
category : mobiles
brand : samsung
Offer : exchange offer
NRT	
  Forward	
  Index	
  -­‐	
  Considera;ons	
  
●  Lookup	
  efficiency	
  	
  
–  50th	
  percen;le	
  :	
  ~10K	
  matches	
  
–  99th	
  percen;le	
  :	
  ~1	
  million	
  matches	
  
●  Data	
  on	
  Java	
  heap	
  
–  Memory	
  efficiency	
  
	
  
NRT	
  Forward	
  Index	
  -­‐	
  Naive	
  Implementa;on	
  
NRT Forward IndexLucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ProductD
ProductD
ProductA
ProductB
ProductC
ProductD
True
False
False
True
100
150
200
250
ProductId(3) <ProductD,price>
DocId : 3
field: price
250
ProductId Availability Price
Latency : ~10 secs for ~1 Million
lookups
NRT	
  Store	
  -­‐	
  Forward	
  Index	
  Op;mized	
  
Lookup Engine
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
DocId : 3
Field : price
250
DocId - NrtId
0
1
2
3
3
0
1
2
NrtId(3)
2
Price(2)
NRT Forward Index (Segment Independent)
100 200 250 150Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
Availability T F F T
Status 01 10 01 00
Latency : ~100 ms for ~1 Million lookups
NRT	
  Store	
  Filter	
  -­‐	
  PostFilter	
  
PostFilter(Price:[100 TO 150])
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
DocId : 3
Don’t
Delegate
DocId - NrtId
0
1
2
3
3
0
1
2
NrtId(3)
2
Price(2)
NRT Forward Index (Segment Independent)
100 200 250 150Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
Availability T F F T
Status 01 10 01 00
NRT Filter
NRT	
  Store	
  -­‐	
  Invert	
  index	
  
NRT Forward StoreNRT Inverter
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
NRT DocIdSet Cache
Availability : T 0 3
Offer : O1 2 3
Offer:O1 DocIdSet
Solr	
  Integra;on	
  Points	
  
•  ValueSources	
  
•  Filtering	
  
–  Custom	
  Filter	
  Implementa;on	
  for	
  cached	
  DocIdSet	
  
–  Custom	
  PostFilter	
  
•  Query	
  
–  Wrapper	
  over	
  Filter	
  
•  Custom	
  FacetComponent	
  
Near	
  Real	
  Time	
  Solr	
  Architecture	
  
Solr
Kafka
Ingestion pipeline
NRT Forward
Index
Ranking
Matching
Faceting
Redis
Bootstrap
NRT Inverted
store
Solr Master
NRT Updates
Lucene Updates
Catalogue
Pricing
Availability
Offers
Seller
Quality
Commit
+
Replicate
+
Reopen
Lucene
Others
Accomplishments	
  
•  Real	
  ;me	
  sor;ng	
  
•  Real	
  ;me	
  filtering	
  :	
  PostFilter	
  
–  Higher	
  latency	
  
•  Near	
  real	
  ;me	
  filtering	
  :	
  cached	
  DocIdSet	
  
–  No	
  consistency	
  between	
  lookup	
  and	
  filtering	
  
•  Independent	
  of	
  lucene	
  commits	
  
•  Query	
  latency	
  comparable	
  to	
  DocValues	
  
–  Consistent	
  99%	
  performance	
  
Accomplishments	
  @	
  Flipkart	
  
●  Real	
  ;me	
  consump;on	
  for	
  ~150	
  Signals	
  
●  Reduc;on	
  in	
  shown	
  out	
  of	
  stock	
  products	
  by	
  2X	
  
●  Produc;on	
  instances	
  of	
  ~50K	
  updates/second	
  real	
  ;me	
  
Thank	
  you	
  
&	
  
Ques8ons	
  

More Related Content

What's hot

Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
Databricks
 

What's hot (20)

History of Apache Pinot
History of Apache Pinot History of Apache Pinot
History of Apache Pinot
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis LabsRedis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
SEO Restart 2022: Radim Daniel Pánek - Milisekundy vydělávají miliony, tak ne...
SEO Restart 2022: Radim Daniel Pánek - Milisekundy vydělávají miliony, tak ne...SEO Restart 2022: Radim Daniel Pánek - Milisekundy vydělávají miliony, tak ne...
SEO Restart 2022: Radim Daniel Pánek - Milisekundy vydělávají miliony, tak ne...
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
SEO Restart 2022: Eliška Bielková - Proč by se měl SEO specialista zajímat o ...
SEO Restart 2022: Eliška Bielková - Proč by se měl SEO specialista zajímat o ...SEO Restart 2022: Eliška Bielková - Proč by se měl SEO specialista zajímat o ...
SEO Restart 2022: Eliška Bielková - Proč by se měl SEO specialista zajímat o ...
 
Frame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningFrame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine Learning
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
 
Bi-temporal rdbms 2014
Bi-temporal rdbms 2014Bi-temporal rdbms 2014
Bi-temporal rdbms 2014
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Scaling Search Campaigns With Bulk Uploads and Ad Customizers (SMX 2023)
Scaling Search Campaigns With Bulk Uploads and Ad Customizers (SMX 2023)Scaling Search Campaigns With Bulk Uploads and Ad Customizers (SMX 2023)
Scaling Search Campaigns With Bulk Uploads and Ad Customizers (SMX 2023)
 
데이터 기반 이커머스 개인화 추천 기획 | 마켓컬리 Market Kurly
데이터 기반 이커머스 개인화 추천 기획 | 마켓컬리 Market Kurly데이터 기반 이커머스 개인화 추천 기획 | 마켓컬리 Market Kurly
데이터 기반 이커머스 개인화 추천 기획 | 마켓컬리 Market Kurly
 
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)
 
A Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeA Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta Lake
 

Similar to Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Lucidworks
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
eswcsummerschool
 
JConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with JavaJConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with Java
Timothy Spann
 

Similar to Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart (20)

near real time search in e-commerce
near real time search in e-commerce  near real time search in e-commerce
near real time search in e-commerce
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Data monstersrealtimeetl new
Data monstersrealtimeetl newData monstersrealtimeetl new
Data monstersrealtimeetl new
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Patterns of Streaming Applications
Patterns of Streaming ApplicationsPatterns of Streaming Applications
Patterns of Streaming Applications
 
Using Deep Learning and Customized Solr Components to Improve search Relevanc...
Using Deep Learning and Customized Solr Components to Improve search Relevanc...Using Deep Learning and Customized Solr Components to Improve search Relevanc...
Using Deep Learning and Customized Solr Components to Improve search Relevanc...
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLGoodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
 
Real Time Insights for Advertising Tech
Real Time Insights for Advertising TechReal Time Insights for Advertising Tech
Real Time Insights for Advertising Tech
 
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
 
(ARC310) Solving Amazon's Catalog Contention With Amazon Kinesis
(ARC310) Solving Amazon's Catalog Contention With Amazon Kinesis(ARC310) Solving Amazon's Catalog Contention With Amazon Kinesis
(ARC310) Solving Amazon's Catalog Contention With Amazon Kinesis
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
JConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with JavaJConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with Java
 
Salesforce Apex Hours : How Lightning Platform Query Optimizer works for LDV
Salesforce Apex Hours : How Lightning Platform Query Optimizer works for LDVSalesforce Apex Hours : How Lightning Platform Query Optimizer works for LDV
Salesforce Apex Hours : How Lightning Platform Query Optimizer works for LDV
 
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
 

More from Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

  • 1. OCTOBER  11-­‐14,  2016    •    BOSTON,  MA  
  • 2. Near  Real  8me  Indexing   Building  Real  Time  Search  Index  For  E-­‐Commerce     Umesh  Prasad   Tech  Lead    @  Flipkart     Thejus  V  M   Data  Architect  @  Flipkart      
  • 3. Agenda   •  Search  @  Flipkart   •  Need  for  Real  Time  Search   •  SolrCloud  Solu;on   •  Our  approach   •  Q  &  A  
  • 4.
  • 5.
  • 6. Traffic  @  Flipkart   •  Peak  Traffic     –  ~  800K  ac;ve  users   –  ~  160K    requests  per  second     •  Search  Traffic     –  ~  40K  searches  per  second  (Service)   –  ~  10K  searches  per  second  (Solr  )   •  Latency   –   Median  :  11  ms   –   99th  percen;le  :  1.1  second  
  • 7. Search  @  Flipkart   •  Catalogue     –  ~  50  main  categories   – ~  5000  sub-­‐categories   – ~  231  million  documents   – ~  90  million  SKUs   – ~  160  million  lis;ngs     •  E-­‐commerce  Marketplace     – ~  100K    Sellers   – Local  Sellers   – Regional  Availability   – Logis;cs  Constraints    
  • 8. E-­‐commerce  Search   •  Heavy  usage  of  drill  down  filters   •  Heavy  usage  of  face;ng   •  Only  top  results  maer   •  Results  grouped/collapsed  by  products     •  Serviceability  and  delivery  experience  MATTERS    
  • 9. Agenda   •  Search  @  Flipkart   •  Need  for  Real  Time  Search   •  SolrCloud  Solu;on   •  Our  approach   •  Q  &  A  
  • 10. Sorry,      Stock  Over      !!?  
  • 11. Damn  !!  Is  Offer  Over  ??  
  • 12. What  !!    All  Steal  Deals  Gone  ??  
  • 13. Product  /Lis;ng:  Important  Aributes   Seller   Ra;ng   Service   catalogue   service   Promise   Service   Availability Service Offer   Service   Pricing   Service   Product  aka  SKU   Lis;ngs  
  • 14. Summary  :    Lucene  Document   •  Product/SKU    (Parent  Document)   –  Lis;ng  (Child  Document)     •  Query  :    Mostly    SKU  Aributes            (Free  Text)   •  Filters  :  SKU  +    Lis;ng  Aributes        (Drill  Down)   •  Ranking  :  SKU  +  Lis;ng    Aributes        (Explicit/ Relevance)     •  Index  Time  Join  aka  Block  Join        (Best   Performance)      
  • 15. Out  Of  Stock,  but  Why  Show?   Index has Stale Availability Data 234K   Products  
  • 16. Challenge  1  :  High  Update  Rates   updates  /  sec   updates  /hr     normal   Peak   text  /  catalogue   ~10   ~100   ~100K   pricing   ~100   ~1K   ~10  million   availability   ~100   ~10K   ~10  million   offer   ~100   ~10K   ~10  million   seller  ra8ng   ~10   ~1K   ~1  million   signal  6   ~10   ~100   ~1  million   signal  7   ~100   ~10K   ~10  million   signal  8   ~100   ~10K   ~10  million  
  • 17. Challenge  2  :  Micro  Services     Ingestion pipeline Catalogue Pricing Availability Offers ... Document Builder Solr/Lucene Change Propagation Documents {L1,L2 … P1} Updates Stream 1 Updates Stream 2 Updates Stream 3 ●  Lucene doesn’t support Partial Updates ●  Update = Delete + Add
  • 18. Agenda   •  Search  @  Flipkart   •  Need  for  Real  Time  Search   •  SolrCloud  Solu;on   •  Our  approach   •  Q  &  A  
  • 19. SolrCloud  for  NRT   Shard Replica Shard Replica Shard Replica Shard Replica Shard Replica Shard Replica Re-open searcher Re-open searcher Re-open searcher Re-open searcher Re-open searcher Re-open searcher Ingestion pipeline Shard Leader Auto commit Soft Commit Batch of documents For Document Versioning Update Log Forward to Replica
  • 20. SolrCloud  Evalua;on   •  Update  =  Delete  +  Add   –  Block  Join  Index    Update  Whole  Block  (Product  +  Lis;ngs)   •  Updated  Document  gets  streamed  to  all  replicas  in  sync   –  Reduces  indexing  throughput   •  Sol  commit  is  Not  Free   –  Sol  commit    In  Memory  Segment   –  Lots  of  Merges   –  Huge  document  churn  /  deletes   –  All  caches  s;ll  need  to  be  re-­‐generated   –  Filter  Cache  miss  specially  hurts  performance  
  • 21. Agenda   •  Search  @  Flipkart   •  Need  for  Real  Time  Index   •  SolrCloud  Solu;on   • Our  approach   •  Q  &  A  
  • 22. ProductA brand : Apple availability : T price : 45000 ProductB brand : Samsung availability : T price : 23000 ProductC brand : Apple availability : F price : 5000 Document ID Mappings Posting List (Inverted Index) DocValues (columunar data) Lucene Segment Lucene  Index   0 ProductA 1 ProductB 2 ProductC 45000 23000 5000Price availability : T brand : Samsung brand : Apple 0 , 2 1 0 , 1 Terms Sparse Bitsets
  • 23. A  Typical  Search  Flow   Query Rewrite Results Query Matching Ranking Faceting Stats Posting List Doc Values Other Components Lucene Segment Inverted Index Forward Index NRT Store samsung mobiles Offer : exchange offer price desc category : mobiles brand : samsung Offer : exchange offer
  • 24. NRT  Forward  Index  -­‐  Considera;ons   ●  Lookup  efficiency     –  50th  percen;le  :  ~10K  matches   –  99th  percen;le  :  ~1  million  matches   ●  Data  on  Java  heap   –  Memory  efficiency    
  • 25. NRT  Forward  Index  -­‐  Naive  Implementa;on   NRT Forward IndexLucene Segment Lookup Engine 0 ProductB 1 ProductA 2 ProductC 3 ProductD ProductD ProductA ProductB ProductC ProductD True False False True 100 150 200 250 ProductId(3) <ProductD,price> DocId : 3 field: price 250 ProductId Availability Price Latency : ~10 secs for ~1 Million lookups
  • 26. NRT  Store  -­‐  Forward  Index  Op;mized   Lookup Engine Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD DocId : 3 Field : price 250 DocId - NrtId 0 1 2 3 3 0 1 2 NrtId(3) 2 Price(2) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00 Latency : ~100 ms for ~1 Million lookups
  • 27. NRT  Store  Filter  -­‐  PostFilter   PostFilter(Price:[100 TO 150]) Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD DocId : 3 Don’t Delegate DocId - NrtId 0 1 2 3 3 0 1 2 NrtId(3) 2 Price(2) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00
  • 28. NRT Filter NRT  Store  -­‐  Invert  index   NRT Forward StoreNRT Inverter Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD NRT DocIdSet Cache Availability : T 0 3 Offer : O1 2 3 Offer:O1 DocIdSet
  • 29. Solr  Integra;on  Points   •  ValueSources   •  Filtering   –  Custom  Filter  Implementa;on  for  cached  DocIdSet   –  Custom  PostFilter   •  Query   –  Wrapper  over  Filter   •  Custom  FacetComponent  
  • 30. Near  Real  Time  Solr  Architecture   Solr Kafka Ingestion pipeline NRT Forward Index Ranking Matching Faceting Redis Bootstrap NRT Inverted store Solr Master NRT Updates Lucene Updates Catalogue Pricing Availability Offers Seller Quality Commit + Replicate + Reopen Lucene Others
  • 31. Accomplishments   •  Real  ;me  sor;ng   •  Real  ;me  filtering  :  PostFilter   –  Higher  latency   •  Near  real  ;me  filtering  :  cached  DocIdSet   –  No  consistency  between  lookup  and  filtering   •  Independent  of  lucene  commits   •  Query  latency  comparable  to  DocValues   –  Consistent  99%  performance  
  • 32. Accomplishments  @  Flipkart   ●  Real  ;me  consump;on  for  ~150  Signals   ●  Reduc;on  in  shown  out  of  stock  products  by  2X   ●  Produc;on  instances  of  ~50K  updates/second  real  ;me  
  • 33. Thank  you   &   Ques8ons