SlideShare a Scribd company logo
1 of 15
Download to read offline
Consuming Real
time Signals in Solr
Umesh Prasad
SDE 3 @ Flipkart
Flipkart’s Index
Flipkart’s Index
1. Data organized in multiple indexes/Solr cores. Couple of millions of
documents.
2. SKUs are documents.
3. Data organized in multiple solr cores.
4. Extensive use of facets and filters.
5. All search doesn’t allow faceting.
Lots of custom components
1. Custom collectors ( for enabling blending of results for diversity /
personalization )
2. Custom Query parsers ( for enabling really customized scoring)
3. Custom fields
Typical Ecommerce Document
● Catalogue data
○ Static
○ Largely textual
● Pricing related data
○ Dynamic
○ Faster moving
● Offers
○ Channel specific based on nature of event
● Availability
○ Dynamic
○ Faster moving
and more...
First Cut Integration
1. Catalogue Management System aka CMS
a. Single Source of truth for all Systems
b. Merges data from multiple sources, doing joins and keeps the latest snapshot,
keyed by Product Id
c. Raises notification whenever the data changes .
Catalogue Management System
(Static and dynamic)
Data Import
Handler
(Fetch, Transform,
Dedup,
Update)
SOLR
Notification
Sales Signals,
Custom tags
But ….
1. Limitations
a. Too much data ( and more than 80% , not of any interest to search system)
b. CMS has to keep data for ever. (Remember it is source of truth). But search System
doesn’t need to index all documents. ( obsolete products). So lots of drops.
c. Merging becomes too much for CMS. Introduces Lag.
2. DIH Limitations
a. Single Threaded. (Multithreaded had bugs and was removed in 4X SOLR-3262)
b. Too many notifications from CMS. ( Fetch, Transform, compare, discard still costs) and
single threaded doesn’t help.
c. Some signals are of interest to search system only. (Normalized revenue, tag pages). But
difficult to integrate proactively.
So CMS is re-factored
CMS
(service)
Dynamic Field 1
Service (service)
Notification stream
Notification stream
dynamic sorting fields (
sparse but a lot of them
)
(mysql db)
Snapshot
SOLR Master
External Field ,
consumed through
DIH
Solr
Slaves
Why are Partial updates a challenge in Lucene ?
1. Update
a. Lucene doesn’t support partial updates. Tough to do with inverted index. It
is because all terms for that document needs to be updated. Lots of open
tickets
b. LUCENE-4272 (term vector based), LUCENE-3837, LUCENE-4258
(overlay segment based) , Incremental Field Updates through Stacked
Segments
c. Document @ t1 → Term vectors {T1, T2, T3, T4, T5}
d. Document @ t2 → Term vectors { T1, T4, T10 }
e. Inverted index actually stores the posting list for its terms. These posting
lists are quite sparse and compressed using delta encodings for efficiency
reasons.
f. T1 → {1, 5, 7 } etc
g. T2 → {2, 5, 6}
h. To support partial update, the document has to be removed from posting
listing of all its previous terms .. That is non-trivial. Because that will involve
remembering and storing all terms for a given document.
i. So instead Lucene and inverted index systems, mark old document as
deleted in another data structure (live docs)
Why are Partial updates a challenge in Lucene ?
1. What it means is a update in actually
a. Delete + Add . ( Regardless of which
attribute changed)
b. Deleted documents are compacted by a
background merge thread.
2. Updates become only after a commit
c. Soft commit will create a new segment in
memory.
d. Hard commit will do a fsync to directory.
But do we need to re-index a document ? Lets evaluate
1. Lucene might hold 3 kinds of data
a. Data used for actual search ( analyzed, converted into tokens )
b. Data used for plain filtering ( not analyzed, e.g. price, discount)
c. Data used for ranking ( e.g. relevancy signals and there can be a
lot of them)
2. Searchable Attributes ⇒ Need be to inverted. ⇒ Slow Changing.
a. Pipeline can be spam filtering → text cleaning → duplicate
detection → NLP → Entity extraction etc etc
3. Facetable/Filterable Attributes ⇒ Little Analysis ⇒ Numeric or Tags ,
usually with enumerated values
a. Can be dynamic
b. Can be governed by policies and business constraints.
But do we need to re-index a document ? Lets evaluate
1. Ranking Signals ⇒ Needs to be row oriented.
a. Can be batch update (e.g. category specific ranks, ratings)
or real time updates e.g. availability.
b. Lucene actually un-inverts such fields using FieldCache
c. Doc values were introduced to manage the cost of
FieldCache and better provide updatability.
d. updatable NumericDocValues (LUCENE-5189, since 4.6)
, updatable binary doc values (LUCENE-5513, since 4.8)
e. Solr still doesn’t have updatabale doc values. Jira ticket
open, but issues around update/write-ahead logs. ( SOLR-
5944)
First Approach : Leverage Updatable Numeric DocValues
1. Solr Limitation : Easily overcome in master slave model by
plugging your own update chain and accessing IndexWriter
directly.
2. But :
a. You need a commit for docvalues to reflect. ( Not real time !! )
b. Filtering on DocValues : is inefficient. Specially on Numeric
Fields.
c. Making it work is solr cloud is non trivial. For details please
see SOLR-5944.
d. Docvalues are dense. Updates are not stacked. It always
dumps the full view of modified field doc value on every
commit. (optimizing for search performance) (http://shaierera.
blogspot.in/2014/04/updatable-docvalues-under-hood.html)
e. But what if we had 500 fields doc values for millions of docs.
First Approach : Leverage Updatable Numeric DocValues
1. Commit caveats:
a. Soft commits is NOT FREE.
Soft-commit in solr = IndexWriter.getReader() in lucene ==
flush + open .
There is NRTCachingDirectory, which caches the small
segment produced and makes it cheaper to do soft
commits. Details can found in McCandless’s post.
b. In Solr invalidate all caches and they have to be re-
generated on every commit. Some caches like filterCache
have a huge impact on performance. Warming them up
itself might take 2-3 minutes at times.
c. Warmup puts memory pressure on jvm and builds spikes
in allocations. Some caches like documentCache can’t
even be warmed up.
d. More commits ⇒ more segments ⇒ more merges
2nd Approach. : NRT Store and Value Sources
http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/ValueSource.html
- abstract FunctionValues getValues(Map context, AtomicReaderContext readerContext)
Gets the values for this reader and the context that was previously passed to createWeight()
http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/FunctionValues.htm
FunctionValues
- boolean exists(int doc) : Returns true if there is a value for this document
- double doubleVal(int doc)
Value Sources Allowed us to Plug External Data sources right inside Solr. These
external data need not be part of the index themselves, but should be easily retrievable.
Because they would be called millions of times and right inside a loop.
The Challenge
1. Entries in Solr caches have really no expiry time and have no way to invalidate entries.
2. Solution : Get rid of query cache altogether. But still, we have filterCache.
3. So now : matching and scoring had to be really fast.
a. Calls to value source need to be extremely fast. We have optimized them out, so
that they are as fast as accessing doc values.
b. The cost of ranking functions themselves. Some of the optimizations involved
getting and reducing cost of Math functions themselves
So the learnings
1. Understand your data, change rate and what you want to do with your data
2. Solr / Lucene have really good abstractions both around indexing and query. Both
provide you with a lot of hooks and plugins. Think through and take advantage of them.
3. Experiment, profile and benchmark. Delve into the APIs and internals.
4. The experts do help. The dense docValues and softcommits not being free, were direct
contributions of discussions with Shalin.
5. Learnt the hard way : It is really difficult to keep inverted index in sync. We actually built
a lucene-codecs (which built and updated inverted index in redis).

More Related Content

What's hot

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Twitter Search Architecture
Twitter Search Architecture Twitter Search Architecture
Twitter Search Architecture Ramez Al-Fayez
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Spark Summit
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchSigmoid
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민종민 김
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solrKnoldus Inc.
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...Jisang Yoon
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Guido Schmutz
 
Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리종민 김
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
 

What's hot (20)

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Twitter Search Architecture
Twitter Search Architecture Twitter Search Architecture
Twitter Search Architecture
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solr
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 

Viewers also liked

The parsers & test upload
The parsers & test uploadThe parsers & test upload
The parsers & test uploadAnupam Jain
 
EmergingTrendsInComputingAndProgrammingLanguages
EmergingTrendsInComputingAndProgrammingLanguagesEmergingTrendsInComputingAndProgrammingLanguages
EmergingTrendsInComputingAndProgrammingLanguagesDeepak Shevani
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Evolution of Programming Languages
Evolution of Programming LanguagesEvolution of Programming Languages
Evolution of Programming LanguagesSayanee Basu
 
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseImplementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseLucidworks (Archived)
 
Evolution of Programming Languages Over the Years
Evolution of Programming Languages Over the YearsEvolution of Programming Languages Over the Years
Evolution of Programming Languages Over the Yearsdesigns.codes
 

Viewers also liked (8)

Search@flipkart
Search@flipkartSearch@flipkart
Search@flipkart
 
The parsers & test upload
The parsers & test uploadThe parsers & test upload
The parsers & test upload
 
EmergingTrendsInComputingAndProgrammingLanguages
EmergingTrendsInComputingAndProgrammingLanguagesEmergingTrendsInComputingAndProgrammingLanguages
EmergingTrendsInComputingAndProgrammingLanguages
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Evolution of Programming Languages
Evolution of Programming LanguagesEvolution of Programming Languages
Evolution of Programming Languages
 
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseImplementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
 
Evolution of Programming Languages Over the Years
Evolution of Programming Languages Over the YearsEvolution of Programming Languages Over the Years
Evolution of Programming Languages Over the Years
 

Similar to Consuming RealTime Signals in Solr

Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
 
127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collectionsAmit Sharma
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014nkabra
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaSpringPeople
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorizationAndreas Loupasakis
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization Warply
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architectureAjeet Singh
 
KP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation MethodologyKP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation MethodologyDataStax Academy
 
22827361 ab initio-fa-qs
22827361 ab initio-fa-qs22827361 ab initio-fa-qs
22827361 ab initio-fa-qsCapgemini
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf pointsocporacledba
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf pointsdba3003
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesCidar Mendizabal
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperDerek Diamond
 

Similar to Consuming RealTime Signals in Solr (20)

Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
Remus_3_0
Remus_3_0Remus_3_0
Remus_3_0
 
127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architecture
 
BigDataDebugging
BigDataDebuggingBigDataDebugging
BigDataDebugging
 
KP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation MethodologyKP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation Methodology
 
22827361 ab initio-fa-qs
22827361 ab initio-fa-qs22827361 ab initio-fa-qs
22827361 ab initio-fa-qs
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiences
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White Paper
 
Bt0066
Bt0066Bt0066
Bt0066
 
B T0066
B T0066B T0066
B T0066
 
LDV.pptx
LDV.pptxLDV.pptx
LDV.pptx
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Consuming RealTime Signals in Solr

  • 1. Consuming Real time Signals in Solr Umesh Prasad SDE 3 @ Flipkart
  • 2. Flipkart’s Index Flipkart’s Index 1. Data organized in multiple indexes/Solr cores. Couple of millions of documents. 2. SKUs are documents. 3. Data organized in multiple solr cores. 4. Extensive use of facets and filters. 5. All search doesn’t allow faceting. Lots of custom components 1. Custom collectors ( for enabling blending of results for diversity / personalization ) 2. Custom Query parsers ( for enabling really customized scoring) 3. Custom fields
  • 3. Typical Ecommerce Document ● Catalogue data ○ Static ○ Largely textual ● Pricing related data ○ Dynamic ○ Faster moving ● Offers ○ Channel specific based on nature of event ● Availability ○ Dynamic ○ Faster moving and more...
  • 4. First Cut Integration 1. Catalogue Management System aka CMS a. Single Source of truth for all Systems b. Merges data from multiple sources, doing joins and keeps the latest snapshot, keyed by Product Id c. Raises notification whenever the data changes . Catalogue Management System (Static and dynamic) Data Import Handler (Fetch, Transform, Dedup, Update) SOLR Notification Sales Signals, Custom tags
  • 5. But …. 1. Limitations a. Too much data ( and more than 80% , not of any interest to search system) b. CMS has to keep data for ever. (Remember it is source of truth). But search System doesn’t need to index all documents. ( obsolete products). So lots of drops. c. Merging becomes too much for CMS. Introduces Lag. 2. DIH Limitations a. Single Threaded. (Multithreaded had bugs and was removed in 4X SOLR-3262) b. Too many notifications from CMS. ( Fetch, Transform, compare, discard still costs) and single threaded doesn’t help. c. Some signals are of interest to search system only. (Normalized revenue, tag pages). But difficult to integrate proactively.
  • 6. So CMS is re-factored CMS (service) Dynamic Field 1 Service (service) Notification stream Notification stream dynamic sorting fields ( sparse but a lot of them ) (mysql db) Snapshot SOLR Master External Field , consumed through DIH Solr Slaves
  • 7. Why are Partial updates a challenge in Lucene ? 1. Update a. Lucene doesn’t support partial updates. Tough to do with inverted index. It is because all terms for that document needs to be updated. Lots of open tickets b. LUCENE-4272 (term vector based), LUCENE-3837, LUCENE-4258 (overlay segment based) , Incremental Field Updates through Stacked Segments c. Document @ t1 → Term vectors {T1, T2, T3, T4, T5} d. Document @ t2 → Term vectors { T1, T4, T10 } e. Inverted index actually stores the posting list for its terms. These posting lists are quite sparse and compressed using delta encodings for efficiency reasons. f. T1 → {1, 5, 7 } etc g. T2 → {2, 5, 6} h. To support partial update, the document has to be removed from posting listing of all its previous terms .. That is non-trivial. Because that will involve remembering and storing all terms for a given document. i. So instead Lucene and inverted index systems, mark old document as deleted in another data structure (live docs)
  • 8. Why are Partial updates a challenge in Lucene ? 1. What it means is a update in actually a. Delete + Add . ( Regardless of which attribute changed) b. Deleted documents are compacted by a background merge thread. 2. Updates become only after a commit c. Soft commit will create a new segment in memory. d. Hard commit will do a fsync to directory.
  • 9. But do we need to re-index a document ? Lets evaluate 1. Lucene might hold 3 kinds of data a. Data used for actual search ( analyzed, converted into tokens ) b. Data used for plain filtering ( not analyzed, e.g. price, discount) c. Data used for ranking ( e.g. relevancy signals and there can be a lot of them) 2. Searchable Attributes ⇒ Need be to inverted. ⇒ Slow Changing. a. Pipeline can be spam filtering → text cleaning → duplicate detection → NLP → Entity extraction etc etc 3. Facetable/Filterable Attributes ⇒ Little Analysis ⇒ Numeric or Tags , usually with enumerated values a. Can be dynamic b. Can be governed by policies and business constraints.
  • 10. But do we need to re-index a document ? Lets evaluate 1. Ranking Signals ⇒ Needs to be row oriented. a. Can be batch update (e.g. category specific ranks, ratings) or real time updates e.g. availability. b. Lucene actually un-inverts such fields using FieldCache c. Doc values were introduced to manage the cost of FieldCache and better provide updatability. d. updatable NumericDocValues (LUCENE-5189, since 4.6) , updatable binary doc values (LUCENE-5513, since 4.8) e. Solr still doesn’t have updatabale doc values. Jira ticket open, but issues around update/write-ahead logs. ( SOLR- 5944)
  • 11. First Approach : Leverage Updatable Numeric DocValues 1. Solr Limitation : Easily overcome in master slave model by plugging your own update chain and accessing IndexWriter directly. 2. But : a. You need a commit for docvalues to reflect. ( Not real time !! ) b. Filtering on DocValues : is inefficient. Specially on Numeric Fields. c. Making it work is solr cloud is non trivial. For details please see SOLR-5944. d. Docvalues are dense. Updates are not stacked. It always dumps the full view of modified field doc value on every commit. (optimizing for search performance) (http://shaierera. blogspot.in/2014/04/updatable-docvalues-under-hood.html) e. But what if we had 500 fields doc values for millions of docs.
  • 12. First Approach : Leverage Updatable Numeric DocValues 1. Commit caveats: a. Soft commits is NOT FREE. Soft-commit in solr = IndexWriter.getReader() in lucene == flush + open . There is NRTCachingDirectory, which caches the small segment produced and makes it cheaper to do soft commits. Details can found in McCandless’s post. b. In Solr invalidate all caches and they have to be re- generated on every commit. Some caches like filterCache have a huge impact on performance. Warming them up itself might take 2-3 minutes at times. c. Warmup puts memory pressure on jvm and builds spikes in allocations. Some caches like documentCache can’t even be warmed up. d. More commits ⇒ more segments ⇒ more merges
  • 13. 2nd Approach. : NRT Store and Value Sources http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/ValueSource.html - abstract FunctionValues getValues(Map context, AtomicReaderContext readerContext) Gets the values for this reader and the context that was previously passed to createWeight() http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/FunctionValues.htm FunctionValues - boolean exists(int doc) : Returns true if there is a value for this document - double doubleVal(int doc) Value Sources Allowed us to Plug External Data sources right inside Solr. These external data need not be part of the index themselves, but should be easily retrievable. Because they would be called millions of times and right inside a loop.
  • 14. The Challenge 1. Entries in Solr caches have really no expiry time and have no way to invalidate entries. 2. Solution : Get rid of query cache altogether. But still, we have filterCache. 3. So now : matching and scoring had to be really fast. a. Calls to value source need to be extremely fast. We have optimized them out, so that they are as fast as accessing doc values. b. The cost of ranking functions themselves. Some of the optimizations involved getting and reducing cost of Math functions themselves
  • 15. So the learnings 1. Understand your data, change rate and what you want to do with your data 2. Solr / Lucene have really good abstractions both around indexing and query. Both provide you with a lot of hooks and plugins. Think through and take advantage of them. 3. Experiment, profile and benchmark. Delve into the APIs and internals. 4. The experts do help. The dense docValues and softcommits not being free, were direct contributions of discussions with Shalin. 5. Learnt the hard way : It is really difficult to keep inverted index in sync. We actually built a lucene-codecs (which built and updated inverted index in redis).