A Web search engine must update its index periodically to incorporate changes to the Web. We argue in this paper that index updates fundamentally impact the design of search engine result caches, a performance-critical component of modern search engines. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. Naive approaches, such as flushing the entire cache upon every index update, lead to poor performance and in fact, render caching futile when the frequency of updates is high. Solving the invalidation problem efficiently corresponds to predicting accurately which queries will produce different results if re-evaluated, given the actual changes to the index.
To obtain this property, we propose a framework for developing invalidation predictors and define metrics to evaluate invalidation schemes. We describe concrete predictors using this framework and compare them against a baseline that uses a cache invalidation scheme based on time-to-live (TTL). Evaluation over Wikipedia documents using a query log from the Yahoo! search engine shows that selective invalidation of cached search results can lower the number of unnecessary query evaluations by as much as 30% compared to a baseline scheme, while returning results of similar freshness. In general, our predictors enable fewer unnecessary invalidations and fewer stale results compared to a TTL-only scheme for similar freshness of results.
Caching Search Engine Results over Incremental Indices
1. Caching
Search Engine Results
over
Incremental Indices
Y! Research Barcelona Y! Labs Haifa
Roi Blanco Edward Bortnikov
Flavio Junqueira Ronny Lempel
Luca Telloli *
Hugo Zaragoza
•* currently at Barcelona Supercomputing Center
3. CHIigPh A Lrcehvietle Actrucrheitecture of Search Engines
Cache Query
results
Runtime system
Parser/
Tokenizer
- 3 -
Index
terms
Engine
queries
Indexing pipeline
W WWWWW
4. Web Search Results Caching
Caching of Web Search Results is crucial:
• Query stream is extremely REDUNDANT and BURSTY
– Zipfian distribution of query popularity (redundant)
– Extreme trending of topics (bursty)
• CACHE {q} Search_Results(q)
- 4 -
• Caching benefits :
– Shorten the engine’s response time (user waiting)
– Lower the number/cost of query executions (# data centers)
• Caveat: data (pages) is constantly changing!
5. Caching Search Engine Results – Prior Art
• Markatos, 2001: applied classical replacement policies (LRU,
SLRU) to a 1M query log from Excite; demonstrated hit rates of
~30%
• Replacement policies tailored for search engines:
– PDC: Probability Driven Cache (Lempel & Moran, 2003)
– SDC: Static/Dynamic Cache (Silvestri, Fagni, Orlando, Palmerini and
Perego, 2003)
– AC: Admission-based Caching (Baeza-Yates, Junqueira, Plachouras
and Witschel, 2007)
• Other observations and approaches:
– Lempel & Moran, 2004: theoretical study via competitive analysis
– Gan & Suel, 2009: optimizing query evaluation work rather than hit
rates
– Cambazoglu, Junqueira, Plachouras, Banachowski, Cui, Lim, and
Bridge, 2010: refreshing aged entries during times of low back-end
load
- 5 -
6. - 6 -
Traditional View:
Dilemma: Freshness versus Computation
Extreme #1: do not cache at all – evaluate all queries
100% fresh results, lots of redundant evaluations
Extreme #2: never invalidate the cache
A majority of stale results – results refreshed only due to cache
replacement, no redundant work
Middle ground: invalidate periodically (TTL)
A time-to-live parameter is applied to each cached entry
7. Caching in the Presence of Index Changes
• Increasing importance of freshness in search:
– News, Blogs, Twitter, Social, Reviews, Local…
• Moving towards “Real Time Crawling”:
– Latency measured in seconds instead of hours or days.
• Caching, by definition, returns OLD results
– Traditionally as TTL 0, caching hit rate 0
• Can we have our cake and eat it too?
– Can a cache operate on a very-fast changing collection?
- 7 -
8. Cache Invalidation Predictors
- 8 -
Main idea:
Not ALL the documents change ALL the time.
When a document changes (or is created / deleted)
remove from the cache any quires that may have returned it.
e.g: a new document on Spanish cooking arrives;
no need to invalidate queries about quantum physics!
Cache Invalidator Predictors (CIP):
1. Capture document insertions/updates as they enter the index
2. Using document features, predict which cached search results will be
affected by the updates
3. Invalidate those cached entries (equivalent to eviction)
4. Upon document deletion, invalidate any cached entry containing that
document
9. Cache Invalidation Predictor Architecture
- 9 -
CIP Architecture
Legend
Runtime system
Parser/
Tokenizer
Index
terms
Cache Query
Engine
CIP
Synopsis
generator
queries
Indexing pipeline
Data flow
API calls
10. The Invalidator: Brief Implementation Notes
• The CIP needs to quickly locate,
given a synopsis (e.g. document),
which cached entries it matches
• Essentially a reversed search engine:
– The synopses are the queries
– The queries (whose results are cached) are the documents
• (!) Non-negligible cost of communication, indexing and
querying.
- 10 -
11. - 11 -
Some Definitions
• At any given time, a query in the cache may be:
– Stale: cache entry no longer represents the results the engine
would return for q
– or not stale.
• Stale Rate: proportion of queries for which the search
engine returns stale results
• (Both computable, given enough computing time!)
12. - 12 -
Some Definitions (2)
At any given time a CIP may invalidate a query or not:
True Positive: invalidation of stale query -
True Negative: non-invalidation of non-stale query !
False Positive: invalidation of non-stale query $
False Negative: non-invalidation of stale-query !
False Negatives are much more expensive than False Positives:
User dissatisfaction vs. computational time
Frequency of query!
Error spread forward in time!
True Negatives lead to huge savings (x query volume)
13. Invalidation Policies – Upon Match
Upon match: invalidate query q whenever the synopsis of
document d matches q
E.g., for conjunctive queries, q Í d
e.g. for a BOW engine, stale rate=0
- 13 -
The Boston Celtics beat the L.A.
Lakers on their home court in the
4th game of the 2010 NBA Finals
Very low stale rate
High FP! $$
URL1 0.875
URL2 0.834
…
URL9 0.692
URL10 0.511
URL1 0.924
URL2 0.876
…
URL9 0.769
URL10 0.631
URL1 0.899
URL2 0.867
…
URL9 0.741
URL10 0.651
URL1 0712
URL2 0.690
…
URL9 0.482
URL10 0.375
Oil Spill
URL1 0.905
URL2 0.704
…
URL9 0.662
URL10 0.583
home
URL1 0.999
URL2 0.888
…
URL9 0.222
URL10 0.111
Boston Celtics Barack Obama World Cup L.A. Lakers
14. Invalidation Policies – Score Thresholding
Score Thresholding: invalidate q whenever projected score(q,d) is
high enough (prerequisite: q matches d)
Requires maintaining the min score per result set, and the
stand-alone ability to compute the score of a synopsis w/respect
to any query
Reduces FP’s, increases FN’s President Barack Obama criticized
Score=0.503 Score=0.681
- 14 -
BP yesterday for mishandling the
oil spill in the gulf of Mexico
URL1 0.875
URL2 0.834
…
URL9 0.692
URL10 0.511
URL1 0.924
URL2 0.876
…
URL9 0.769
URL10 0.631
URL1 0.899
URL2 0.867
…
URL9 0.741
URL10 0.651
URL1 0712
URL2 0.690
…
URL9 0.482
URL10 0.375
Oil Spill
URL1 0.905
URL2 0.704
…
URL9 0.662
URL10 0.583
home
URL1 0.999
URL2 0.888
…
URL9 0.222
URL10 0.111
Boston Celtics Barack Obama World Cup L.A. Lakers
15. CIP Policies – Synopsis Generation
Full synopsis: entire document + all ranking attributes
Idea: reduce synopsis by dropping stuff “unlikely” to affect scoring
Less communication but more prediction errors
In this paper:
transfer some fraction of top TF-IDF terms
drop document revisions that didn’t “change much”
- 15 -
We the people of
the United States,
in Order to form a
more perfect Union,
. . .
Order
People
Perfect
union
16. Experimental Setting #1
Sandbox experiment – static cache containing fixed query set,
controlled document/query dynamics (no interleaving)
Data Source: en.wikipedia.org
History span: 2006 – 2008
2.8 TB, > 1M pages
Dominated by updates (> 95%)
- 16 -
Query Source: Y! query log
2 days of queries with a click on Wikipedia (2.5 M)
Sample of 10K queries (9234 unique) chosen u.a.r.
Evaluation pattern:
120 single-day epochs (~4% change/day)
The same 10K query batch at the end of each epoch
Search library: Apache Lucene open-source library
22. - 22 -
Conclusions
The problem of maintaining cached search results over
incremental indexes is real, and under-explored
We proposed the CIP framework for real-time search cache
management
We proposed an experimental setting for CIPs
Demonstrated a simple CIP that significantly improves over prior
art (TTL), and measured sensitivity to various parameters
23. - 23 -
Future Work
• Analyze a real-world scenario (News)
– More drastic update and query dynamics
– More realistic implementation to measure cost overhead
– Compare to dynamic TTL
• Continue Improving CIPs
– Better synopsis
– Connections between corpus dynamics and query dynamics
• Study relation between real-time caching with CIPs and
pre-fetching of results