SlideShare a Scribd company logo
1 of 25
Caching 
Search Engine Results 
over 
Incremental Indices 
Y! Research Barcelona Y! Labs Haifa 
Roi Blanco Edward Bortnikov 
Flavio Junqueira Ronny Lempel 
Luca Telloli * 
Hugo Zaragoza 
•* currently at Barcelona Supercomputing Center
- 2 - 
Overview 
 Caching (Background & Prior Art) 
 Cache Invalidation Predictors 
 Experimental Setup 
 Results 
 Conclusions and Future Work
CHIigPh A Lrcehvietle Actrucrheitecture of Search Engines 
Cache Query 
results 
Runtime system 
Parser/ 
Tokenizer 
- 3 - 
Index 
terms 
Engine 
queries 
Indexing pipeline 
W WWWWW
Web Search Results Caching 
Caching of Web Search Results is crucial: 
• Query stream is extremely REDUNDANT and BURSTY 
– Zipfian distribution of query popularity (redundant) 
– Extreme trending of topics (bursty) 
• CACHE {q}  Search_Results(q) 
- 4 - 
• Caching benefits : 
– Shorten the engine’s response time (user waiting) 
– Lower the number/cost of query executions (# data centers) 
• Caveat: data (pages) is constantly changing!
Caching Search Engine Results – Prior Art 
• Markatos, 2001: applied classical replacement policies (LRU, 
SLRU) to a 1M query log from Excite; demonstrated hit rates of 
~30% 
• Replacement policies tailored for search engines: 
– PDC: Probability Driven Cache (Lempel & Moran, 2003) 
– SDC: Static/Dynamic Cache (Silvestri, Fagni, Orlando, Palmerini and 
Perego, 2003) 
– AC: Admission-based Caching (Baeza-Yates, Junqueira, Plachouras 
and Witschel, 2007) 
• Other observations and approaches: 
– Lempel & Moran, 2004: theoretical study via competitive analysis 
– Gan & Suel, 2009: optimizing query evaluation work rather than hit 
rates 
– Cambazoglu, Junqueira, Plachouras, Banachowski, Cui, Lim, and 
Bridge, 2010: refreshing aged entries during times of low back-end 
load 
- 5 -
- 6 - 
Traditional View: 
 Dilemma: Freshness versus Computation 
 Extreme #1: do not cache at all – evaluate all queries 
 100% fresh results, lots of redundant evaluations 
 Extreme #2: never invalidate the cache 
 A majority of stale results – results refreshed only due to cache 
replacement, no redundant work 
 Middle ground: invalidate periodically (TTL) 
 A time-to-live parameter is applied to each cached entry
Caching in the Presence of Index Changes 
• Increasing importance of freshness in search: 
– News, Blogs, Twitter, Social, Reviews, Local… 
• Moving towards “Real Time Crawling”: 
– Latency measured in seconds instead of hours or days. 
• Caching, by definition, returns OLD results 
– Traditionally as TTL  0, caching hit rate  0 
• Can we have our cake and eat it too? 
– Can a cache operate on a very-fast changing collection? 
- 7 -
Cache Invalidation Predictors 
- 8 - 
 Main idea: 
 Not ALL the documents change ALL the time. 
 When a document changes (or is created / deleted) 
remove from the cache any quires that may have returned it. 
 e.g: a new document on Spanish cooking arrives; 
no need to invalidate queries about quantum physics! 
 Cache Invalidator Predictors (CIP): 
1. Capture document insertions/updates as they enter the index 
2. Using document features, predict which cached search results will be 
affected by the updates 
3. Invalidate those cached entries (equivalent to eviction) 
4. Upon document deletion, invalidate any cached entry containing that 
document
Cache Invalidation Predictor Architecture 
- 9 - 
CIP Architecture 
Legend 
Runtime system 
Parser/ 
Tokenizer 
Index 
terms 
Cache Query 
Engine 
CIP 
Synopsis 
generator 
queries 
Indexing pipeline 
Data flow 
API calls
The Invalidator: Brief Implementation Notes 
• The CIP needs to quickly locate, 
given a synopsis (e.g. document), 
which cached entries it matches 
• Essentially a reversed search engine: 
– The synopses are the queries 
– The queries (whose results are cached) are the documents 
• (!) Non-negligible cost of communication, indexing and 
querying. 
- 10 -
- 11 - 
Some Definitions 
• At any given time, a query in the cache may be: 
– Stale: cache entry no longer represents the results the engine 
would return for q 
– or not stale. 
• Stale Rate: proportion of queries for which the search 
engine returns stale results 
• (Both computable, given enough computing time!)
- 12 - 
Some Definitions (2) 
 At any given time a CIP may invalidate a query or not: 
 True Positive: invalidation of stale query  - 
 True Negative: non-invalidation of non-stale query  ! 
 False Positive: invalidation of non-stale query  $ 
 False Negative: non-invalidation of stale-query  ! 
 False Negatives are much more expensive than False Positives: 
 User dissatisfaction vs. computational time 
 Frequency of query! 
 Error spread forward in time! 
 True Negatives lead to huge savings (x query volume)
Invalidation Policies – Upon Match 
 Upon match: invalidate query q whenever the synopsis of 
document d matches q 
 E.g., for conjunctive queries, q Í d 
 e.g. for a BOW engine, stale rate=0 
- 13 - 
The Boston Celtics beat the L.A. 
Lakers on their home court in the 
4th game of the 2010 NBA Finals 
 Very low stale rate  
 High FP! $$ 
URL1 0.875 
URL2 0.834 
… 
URL9 0.692 
URL10 0.511 
URL1 0.924 
URL2 0.876 
… 
URL9 0.769 
URL10 0.631 
URL1 0.899 
URL2 0.867 
… 
URL9 0.741 
URL10 0.651 
URL1 0712 
URL2 0.690 
… 
URL9 0.482 
URL10 0.375 
Oil Spill 
URL1 0.905 
URL2 0.704 
… 
URL9 0.662 
URL10 0.583 
home 
URL1 0.999 
URL2 0.888 
… 
URL9 0.222 
URL10 0.111 
Boston Celtics Barack Obama World Cup L.A. Lakers
Invalidation Policies – Score Thresholding 
 Score Thresholding: invalidate q whenever projected score(q,d) is 
high enough (prerequisite: q matches d) 
 Requires maintaining the min score per result set, and the 
stand-alone ability to compute the score of a synopsis w/respect 
to any query 
 Reduces FP’s, increases FN’s President Barack Obama criticized 
Score=0.503 Score=0.681 
- 14 - 
BP yesterday for mishandling the 
oil spill in the gulf of Mexico 
URL1 0.875 
URL2 0.834 
… 
URL9 0.692 
URL10 0.511 
URL1 0.924 
URL2 0.876 
… 
URL9 0.769 
URL10 0.631 
URL1 0.899 
URL2 0.867 
… 
URL9 0.741 
URL10 0.651 
URL1 0712 
URL2 0.690 
… 
URL9 0.482 
URL10 0.375 
Oil Spill 
URL1 0.905 
URL2 0.704 
… 
URL9 0.662 
URL10 0.583 
home 
URL1 0.999 
URL2 0.888 
… 
URL9 0.222 
URL10 0.111 
Boston Celtics Barack Obama World Cup L.A. Lakers
CIP Policies – Synopsis Generation 
 Full synopsis: entire document + all ranking attributes 
 Idea: reduce synopsis by dropping stuff “unlikely” to affect scoring 
 Less communication  but more prediction errors  
 In this paper: 
 transfer some fraction of top TF-IDF terms 
 drop document revisions that didn’t “change much” 
- 15 - 
We the people of 
the United States, 
in Order to form a 
more perfect Union, 
. . . 
Order 
People 
Perfect 
union
Experimental Setting #1 
 Sandbox experiment – static cache containing fixed query set, 
controlled document/query dynamics (no interleaving) 
 Data Source: en.wikipedia.org 
 History span: 2006 – 2008 
 2.8 TB, > 1M pages 
 Dominated by updates (> 95%) 
- 16 - 
 Query Source: Y! query log 
 2 days of queries with a click on Wikipedia (2.5 M) 
 Sample of 10K queries (9234 unique) chosen u.a.r. 
 Evaluation pattern: 
 120 single-day epochs (~4% change/day) 
 The same 10K query batch at the end of each epoch 
 Search library: Apache Lucene open-source library
CIP Parameters – Notation Summary 
η Fraction of top-terms in synopsis 0 … 1 
δ Revision modification threshold 0 … 1 
1s Score thresholding applied? 0/1 
τ Time-to-live (TTL) threshold 0 .. ∞ 
Basic CIP: η = 1, δ = 0, τ = ∞, 1s = 0 
- 17 -
- 18 - 
Baseline Comparison 
Policy False 
Positives 
False 
Negatives 
Stale Rate 
No invalidation 
(TTL τ=∞) 
0 0.108 0.768 (!) 
No caching 
(TTL τ=0) 
0.892 0 0 
TTL τ=2 0.446 0.054 0.055 
TTL τ=5 0.179 0.086 0.175 
Basic CIP 
(Full synopses, 
invalidate upon match, 
threshold=no, τ=∞) 
0.679 0.001 0.008 (!)
CIP Effectiveness: varying 1s, τ, and η 
- 19 - 
Shrinking synopsis 
Growing TTL
CIP Effectiveness: varying 1s and δ 
- 20 - 
Increasing 
revision 
threshold
- 21 - 
Best-in-Class Picture 
??
- 22 - 
Conclusions 
 The problem of maintaining cached search results over 
incremental indexes is real, and under-explored 
 We proposed the CIP framework for real-time search cache 
management 
 We proposed an experimental setting for CIPs 
 Demonstrated a simple CIP that significantly improves over prior 
art (TTL), and measured sensitivity to various parameters
- 23 - 
Future Work 
• Analyze a real-world scenario (News) 
– More drastic update and query dynamics 
– More realistic implementation to measure cost overhead 
– Compare to dynamic TTL 
• Continue Improving CIPs 
– Better synopsis 
– Connections between corpus dynamics and query dynamics 
• Study relation between real-time caching with CIPs and 
pre-fetching of results
Thank you! Questions? 
- 24 -
Policy Stability: Curbing Stale Results 
- 25 - 
Still growing 
stable 
Still growing but slowly stable

More Related Content

Viewers also liked

Guiding conservation and sustainable use through a national Prunus africana M...
Guiding conservation and sustainable use through a national Prunus africana M...Guiding conservation and sustainable use through a national Prunus africana M...
Guiding conservation and sustainable use through a national Prunus africana M...Verina Ingram
 
Exploring type-directed, test-driven development: a case study using FizzBuzz
Exploring type-directed, test-driven development: a case study using FizzBuzzExploring type-directed, test-driven development: a case study using FizzBuzz
Exploring type-directed, test-driven development: a case study using FizzBuzzFranklin Chen
 
All YWCA Docs
All YWCA DocsAll YWCA Docs
All YWCA Docsjmingma
 
G48 53011810075
G48 53011810075G48 53011810075
G48 53011810075BenjamasS
 
The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...
The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...
The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...william.m.thomson
 
Reported statements
Reported statementsReported statements
Reported statementsVicky
 
Destination pluto
Destination plutoDestination pluto
Destination plutoLisa Baird
 
Motion review
Motion reviewMotion review
Motion reviewmshenry
 
#ForoEGovAR | Casos de PSC y su adaptación
 #ForoEGovAR | Casos de PSC y su adaptación #ForoEGovAR | Casos de PSC y su adaptación
#ForoEGovAR | Casos de PSC y su adaptaciónCESSI ArgenTIna
 
Strategic research agenda for cocoa coffee Wageningen UR 09062014
Strategic research agenda for cocoa coffee Wageningen UR 09062014Strategic research agenda for cocoa coffee Wageningen UR 09062014
Strategic research agenda for cocoa coffee Wageningen UR 09062014Verina Ingram
 
Recent Developments in Aviation Law
Recent Developments in Aviation LawRecent Developments in Aviation Law
Recent Developments in Aviation LawStites & Harbison
 
Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...
Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...
Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...Verina Ingram
 
Web 2.0..Business Friend or Foe?
Web 2.0..Business Friend or Foe?Web 2.0..Business Friend or Foe?
Web 2.0..Business Friend or Foe?Stites & Harbison
 
Harry potter and the deathly hallows review.
Harry potter and the deathly hallows review.Harry potter and the deathly hallows review.
Harry potter and the deathly hallows review.Becca McPartland
 
Halifax’s Finance and Insurance Industry: Our Opportunity
Halifax’s Finance and Insurance Industry: Our OpportunityHalifax’s Finance and Insurance Industry: Our Opportunity
Halifax’s Finance and Insurance Industry: Our OpportunityHalifax Partnership
 

Viewers also liked (20)

MY PROFILE
MY PROFILEMY PROFILE
MY PROFILE
 
Guiding conservation and sustainable use through a national Prunus africana M...
Guiding conservation and sustainable use through a national Prunus africana M...Guiding conservation and sustainable use through a national Prunus africana M...
Guiding conservation and sustainable use through a national Prunus africana M...
 
Exploring type-directed, test-driven development: a case study using FizzBuzz
Exploring type-directed, test-driven development: a case study using FizzBuzzExploring type-directed, test-driven development: a case study using FizzBuzz
Exploring type-directed, test-driven development: a case study using FizzBuzz
 
All YWCA Docs
All YWCA DocsAll YWCA Docs
All YWCA Docs
 
G48 53011810075
G48 53011810075G48 53011810075
G48 53011810075
 
Pres
PresPres
Pres
 
The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...
The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...
The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...
 
Networking 101
Networking 101Networking 101
Networking 101
 
Reported statements
Reported statementsReported statements
Reported statements
 
Destination pluto
Destination plutoDestination pluto
Destination pluto
 
Motion review
Motion reviewMotion review
Motion review
 
#ForoEGovAR | Casos de PSC y su adaptación
 #ForoEGovAR | Casos de PSC y su adaptación #ForoEGovAR | Casos de PSC y su adaptación
#ForoEGovAR | Casos de PSC y su adaptación
 
Gic2011 aula10-ingles
Gic2011 aula10-inglesGic2011 aula10-ingles
Gic2011 aula10-ingles
 
Strategic research agenda for cocoa coffee Wageningen UR 09062014
Strategic research agenda for cocoa coffee Wageningen UR 09062014Strategic research agenda for cocoa coffee Wageningen UR 09062014
Strategic research agenda for cocoa coffee Wageningen UR 09062014
 
Recent Developments in Aviation Law
Recent Developments in Aviation LawRecent Developments in Aviation Law
Recent Developments in Aviation Law
 
Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...
Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...
Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...
 
Presentation1
Presentation1Presentation1
Presentation1
 
Web 2.0..Business Friend or Foe?
Web 2.0..Business Friend or Foe?Web 2.0..Business Friend or Foe?
Web 2.0..Business Friend or Foe?
 
Harry potter and the deathly hallows review.
Harry potter and the deathly hallows review.Harry potter and the deathly hallows review.
Harry potter and the deathly hallows review.
 
Halifax’s Finance and Insurance Industry: Our Opportunity
Halifax’s Finance and Insurance Industry: Our OpportunityHalifax’s Finance and Insurance Industry: Our Opportunity
Halifax’s Finance and Insurance Industry: Our Opportunity
 

Similar to Caching Search Engine Results over Incremental Indices

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...Petter Skodvin-Hvammen
 
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?ScyllaDB
 
System and User Aspects of Web Search Latency
System and User Aspects of Web Search LatencySystem and User Aspects of Web Search Latency
System and User Aspects of Web Search LatencyTelefonica Research
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...Dirk Petersen
 
Splunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk
 
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...ORAU
 
Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7MarketingArrowECS_CZ
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightDataStax Academy
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setKognitio
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsITProceed
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
 
Swift design session - public object storage scalability
Swift design session  - public object storage scalabilitySwift design session  - public object storage scalability
Swift design session - public object storage scalabilityAlan Jiang
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data WarehousesConnor McDonald
 
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Jim Czuprynski
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooMithun Radhakrishnan
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 

Similar to Caching Search Engine Results over Incremental Indices (20)

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
 
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
 
System and User Aspects of Web Search Latency
System and User Aspects of Web Search LatencySystem and User Aspects of Web Search Latency
System and User Aspects of Web Search Latency
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
 
Splunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search Dojo
 
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
 
Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Swift design session - public object storage scalability
Swift design session  - public object storage scalabilitySwift design session  - public object storage scalability
Swift design session - public object storage scalability
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data Warehouses
 
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 

More from Roi Blanco

From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the WebRoi Blanco
 
Entity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationEntity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationRoi Blanco
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Roi Blanco
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and futureRoi Blanco
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Roi Blanco
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF GraphsRoi Blanco
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic SearchRoi Blanco
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operatorsRoi Blanco
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search EnginesRoi Blanco
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataRoi Blanco
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entitiesRoi Blanco
 

More from Roi Blanco (14)

From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the Web
 
Entity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationEntity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance Minimization
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF Graphs
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entities
 

Recently uploaded

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 

Recently uploaded (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 

Caching Search Engine Results over Incremental Indices

  • 1. Caching Search Engine Results over Incremental Indices Y! Research Barcelona Y! Labs Haifa Roi Blanco Edward Bortnikov Flavio Junqueira Ronny Lempel Luca Telloli * Hugo Zaragoza •* currently at Barcelona Supercomputing Center
  • 2. - 2 - Overview  Caching (Background & Prior Art)  Cache Invalidation Predictors  Experimental Setup  Results  Conclusions and Future Work
  • 3. CHIigPh A Lrcehvietle Actrucrheitecture of Search Engines Cache Query results Runtime system Parser/ Tokenizer - 3 - Index terms Engine queries Indexing pipeline W WWWWW
  • 4. Web Search Results Caching Caching of Web Search Results is crucial: • Query stream is extremely REDUNDANT and BURSTY – Zipfian distribution of query popularity (redundant) – Extreme trending of topics (bursty) • CACHE {q}  Search_Results(q) - 4 - • Caching benefits : – Shorten the engine’s response time (user waiting) – Lower the number/cost of query executions (# data centers) • Caveat: data (pages) is constantly changing!
  • 5. Caching Search Engine Results – Prior Art • Markatos, 2001: applied classical replacement policies (LRU, SLRU) to a 1M query log from Excite; demonstrated hit rates of ~30% • Replacement policies tailored for search engines: – PDC: Probability Driven Cache (Lempel & Moran, 2003) – SDC: Static/Dynamic Cache (Silvestri, Fagni, Orlando, Palmerini and Perego, 2003) – AC: Admission-based Caching (Baeza-Yates, Junqueira, Plachouras and Witschel, 2007) • Other observations and approaches: – Lempel & Moran, 2004: theoretical study via competitive analysis – Gan & Suel, 2009: optimizing query evaluation work rather than hit rates – Cambazoglu, Junqueira, Plachouras, Banachowski, Cui, Lim, and Bridge, 2010: refreshing aged entries during times of low back-end load - 5 -
  • 6. - 6 - Traditional View:  Dilemma: Freshness versus Computation  Extreme #1: do not cache at all – evaluate all queries  100% fresh results, lots of redundant evaluations  Extreme #2: never invalidate the cache  A majority of stale results – results refreshed only due to cache replacement, no redundant work  Middle ground: invalidate periodically (TTL)  A time-to-live parameter is applied to each cached entry
  • 7. Caching in the Presence of Index Changes • Increasing importance of freshness in search: – News, Blogs, Twitter, Social, Reviews, Local… • Moving towards “Real Time Crawling”: – Latency measured in seconds instead of hours or days. • Caching, by definition, returns OLD results – Traditionally as TTL  0, caching hit rate  0 • Can we have our cake and eat it too? – Can a cache operate on a very-fast changing collection? - 7 -
  • 8. Cache Invalidation Predictors - 8 -  Main idea:  Not ALL the documents change ALL the time.  When a document changes (or is created / deleted) remove from the cache any quires that may have returned it.  e.g: a new document on Spanish cooking arrives; no need to invalidate queries about quantum physics!  Cache Invalidator Predictors (CIP): 1. Capture document insertions/updates as they enter the index 2. Using document features, predict which cached search results will be affected by the updates 3. Invalidate those cached entries (equivalent to eviction) 4. Upon document deletion, invalidate any cached entry containing that document
  • 9. Cache Invalidation Predictor Architecture - 9 - CIP Architecture Legend Runtime system Parser/ Tokenizer Index terms Cache Query Engine CIP Synopsis generator queries Indexing pipeline Data flow API calls
  • 10. The Invalidator: Brief Implementation Notes • The CIP needs to quickly locate, given a synopsis (e.g. document), which cached entries it matches • Essentially a reversed search engine: – The synopses are the queries – The queries (whose results are cached) are the documents • (!) Non-negligible cost of communication, indexing and querying. - 10 -
  • 11. - 11 - Some Definitions • At any given time, a query in the cache may be: – Stale: cache entry no longer represents the results the engine would return for q – or not stale. • Stale Rate: proportion of queries for which the search engine returns stale results • (Both computable, given enough computing time!)
  • 12. - 12 - Some Definitions (2)  At any given time a CIP may invalidate a query or not:  True Positive: invalidation of stale query  -  True Negative: non-invalidation of non-stale query  !  False Positive: invalidation of non-stale query  $  False Negative: non-invalidation of stale-query  !  False Negatives are much more expensive than False Positives:  User dissatisfaction vs. computational time  Frequency of query!  Error spread forward in time!  True Negatives lead to huge savings (x query volume)
  • 13. Invalidation Policies – Upon Match  Upon match: invalidate query q whenever the synopsis of document d matches q  E.g., for conjunctive queries, q Í d  e.g. for a BOW engine, stale rate=0 - 13 - The Boston Celtics beat the L.A. Lakers on their home court in the 4th game of the 2010 NBA Finals  Very low stale rate   High FP! $$ URL1 0.875 URL2 0.834 … URL9 0.692 URL10 0.511 URL1 0.924 URL2 0.876 … URL9 0.769 URL10 0.631 URL1 0.899 URL2 0.867 … URL9 0.741 URL10 0.651 URL1 0712 URL2 0.690 … URL9 0.482 URL10 0.375 Oil Spill URL1 0.905 URL2 0.704 … URL9 0.662 URL10 0.583 home URL1 0.999 URL2 0.888 … URL9 0.222 URL10 0.111 Boston Celtics Barack Obama World Cup L.A. Lakers
  • 14. Invalidation Policies – Score Thresholding  Score Thresholding: invalidate q whenever projected score(q,d) is high enough (prerequisite: q matches d)  Requires maintaining the min score per result set, and the stand-alone ability to compute the score of a synopsis w/respect to any query  Reduces FP’s, increases FN’s President Barack Obama criticized Score=0.503 Score=0.681 - 14 - BP yesterday for mishandling the oil spill in the gulf of Mexico URL1 0.875 URL2 0.834 … URL9 0.692 URL10 0.511 URL1 0.924 URL2 0.876 … URL9 0.769 URL10 0.631 URL1 0.899 URL2 0.867 … URL9 0.741 URL10 0.651 URL1 0712 URL2 0.690 … URL9 0.482 URL10 0.375 Oil Spill URL1 0.905 URL2 0.704 … URL9 0.662 URL10 0.583 home URL1 0.999 URL2 0.888 … URL9 0.222 URL10 0.111 Boston Celtics Barack Obama World Cup L.A. Lakers
  • 15. CIP Policies – Synopsis Generation  Full synopsis: entire document + all ranking attributes  Idea: reduce synopsis by dropping stuff “unlikely” to affect scoring  Less communication  but more prediction errors   In this paper:  transfer some fraction of top TF-IDF terms  drop document revisions that didn’t “change much” - 15 - We the people of the United States, in Order to form a more perfect Union, . . . Order People Perfect union
  • 16. Experimental Setting #1  Sandbox experiment – static cache containing fixed query set, controlled document/query dynamics (no interleaving)  Data Source: en.wikipedia.org  History span: 2006 – 2008  2.8 TB, > 1M pages  Dominated by updates (> 95%) - 16 -  Query Source: Y! query log  2 days of queries with a click on Wikipedia (2.5 M)  Sample of 10K queries (9234 unique) chosen u.a.r.  Evaluation pattern:  120 single-day epochs (~4% change/day)  The same 10K query batch at the end of each epoch  Search library: Apache Lucene open-source library
  • 17. CIP Parameters – Notation Summary η Fraction of top-terms in synopsis 0 … 1 δ Revision modification threshold 0 … 1 1s Score thresholding applied? 0/1 τ Time-to-live (TTL) threshold 0 .. ∞ Basic CIP: η = 1, δ = 0, τ = ∞, 1s = 0 - 17 -
  • 18. - 18 - Baseline Comparison Policy False Positives False Negatives Stale Rate No invalidation (TTL τ=∞) 0 0.108 0.768 (!) No caching (TTL τ=0) 0.892 0 0 TTL τ=2 0.446 0.054 0.055 TTL τ=5 0.179 0.086 0.175 Basic CIP (Full synopses, invalidate upon match, threshold=no, τ=∞) 0.679 0.001 0.008 (!)
  • 19. CIP Effectiveness: varying 1s, τ, and η - 19 - Shrinking synopsis Growing TTL
  • 20. CIP Effectiveness: varying 1s and δ - 20 - Increasing revision threshold
  • 21. - 21 - Best-in-Class Picture ??
  • 22. - 22 - Conclusions  The problem of maintaining cached search results over incremental indexes is real, and under-explored  We proposed the CIP framework for real-time search cache management  We proposed an experimental setting for CIPs  Demonstrated a simple CIP that significantly improves over prior art (TTL), and measured sensitivity to various parameters
  • 23. - 23 - Future Work • Analyze a real-world scenario (News) – More drastic update and query dynamics – More realistic implementation to measure cost overhead – Compare to dynamic TTL • Continue Improving CIPs – Better synopsis – Connections between corpus dynamics and query dynamics • Study relation between real-time caching with CIPs and pre-fetching of results
  • 25. Policy Stability: Curbing Stale Results - 25 - Still growing stable Still growing but slowly stable