Caching Search Engine Results over Incremental Indices

•Download as PPT, PDF•

1 like•910 views

A Web search engine must update its index periodically to incorporate changes to the Web. We argue in this paper that index updates fundamentally impact the design of search engine result caches, a performance-critical component of modern search engines. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. Naive approaches, such as flushing the entire cache upon every index update, lead to poor performance and in fact, render caching futile when the frequency of updates is high. Solving the invalidation problem efficiently corresponds to predicting accurately which queries will produce different results if re-evaluated, given the actual changes to the index. To obtain this property, we propose a framework for developing invalidation predictors and define metrics to evaluate invalidation schemes. We describe concrete predictors using this framework and compare them against a baseline that uses a cache invalidation scheme based on time-to-live (TTL). Evaluation over Wikipedia documents using a query log from the Yahoo! search engine shows that selective invalidation of cached search results can lower the number of unnecessary query evaluations by as much as 30% compared to a baseline scheme, while returning results of similar freshness. In general, our predictors enable fewer unnecessary invalidations and fewer stale results compared to a TTL-only scheme for similar freshness of results.

Caching
Search Engine Results
over
Incremental Indices
Y! Research Barcelona Y! Labs Haifa
Roi Blanco Edward Bortnikov
Flavio Junqueira Ronny Lempel
Luca Telloli *
Hugo Zaragoza
•* currently at Barcelona Supercomputing Center

- 2 -
Overview
 Caching (Background & Prior Art)
 Cache Invalidation Predictors
 Experimental Setup
 Results
 Conclusions and Future Work

CHIigPh A Lrcehvietle Actrucrheitecture of Search Engines
Cache Query
results
Runtime system
Parser/
Tokenizer
- 3 -
Index
terms
Engine
queries
Indexing pipeline
W WWWWW

$Web Search Results Caching Caching of Web Search Results is crucial: • Query stream is extremely REDUNDANT and BURSTY – Zipfian distribution of query popularity (redundant) – Extreme trending of topics (bursty) • CACHE {q}  Search_Results(q) - 4 - • Caching benefits : – Shorten the engine’s response time (user waiting) – Lower the number/cost of query executions (# data centers) • Caveat: data (pages) is constantly changing!$

Caching Search Engine Results – Prior Art
• Markatos, 2001: applied classical replacement policies (LRU,
SLRU) to a 1M query log from Excite; demonstrated hit rates of
~30%
• Replacement policies tailored for search engines:
– PDC: Probability Driven Cache (Lempel & Moran, 2003)
– SDC: Static/Dynamic Cache (Silvestri, Fagni, Orlando, Palmerini and
Perego, 2003)
– AC: Admission-based Caching (Baeza-Yates, Junqueira, Plachouras
and Witschel, 2007)
• Other observations and approaches:
– Lempel & Moran, 2004: theoretical study via competitive analysis
– Gan & Suel, 2009: optimizing query evaluation work rather than hit
rates
– Cambazoglu, Junqueira, Plachouras, Banachowski, Cui, Lim, and
Bridge, 2010: refreshing aged entries during times of low back-end
load
- 5 -

- 6 -
Traditional View:
 Dilemma: Freshness versus Computation
 Extreme #1: do not cache at all – evaluate all queries
 100% fresh results, lots of redundant evaluations
 Extreme #2: never invalidate the cache
 A majority of stale results – results refreshed only due to cache
replacement, no redundant work
 Middle ground: invalidate periodically (TTL)
 A time-to-live parameter is applied to each cached entry

Caching in the Presence of Index Changes
• Increasing importance of freshness in search:
– News, Blogs, Twitter, Social, Reviews, Local…
• Moving towards “Real Time Crawling”:
– Latency measured in seconds instead of hours or days.
• Caching, by definition, returns OLD results
– Traditionally as TTL  0, caching hit rate  0
• Can we have our cake and eat it too?
– Can a cache operate on a very-fast changing collection?
- 7 -

Cache Invalidation Predictors
- 8 -
 Main idea:
 Not ALL the documents change ALL the time.
 When a document changes (or is created / deleted)
remove from the cache any quires that may have returned it.
 e.g: a new document on Spanish cooking arrives;
no need to invalidate queries about quantum physics!
 Cache Invalidator Predictors (CIP):
1. Capture document insertions/updates as they enter the index
2. Using document features, predict which cached search results will be
affected by the updates
3. Invalidate those cached entries (equivalent to eviction)
4. Upon document deletion, invalidate any cached entry containing that
document

Cache Invalidation Predictor Architecture
- 9 -
CIP Architecture
Legend
Runtime system
Parser/
Tokenizer
Index
terms
Cache Query
Engine
CIP
Synopsis
generator
queries
Indexing pipeline
Data flow
API calls

The Invalidator: Brief Implementation Notes
• The CIP needs to quickly locate,
given a synopsis (e.g. document),
which cached entries it matches
• Essentially a reversed search engine:
– The synopses are the queries
– The queries (whose results are cached) are the documents
• (!) Non-negligible cost of communication, indexing and
querying.
- 10 -

- 11 -
Some Definitions
• At any given time, a query in the cache may be:
– Stale: cache entry no longer represents the results the engine
would return for q
– or not stale.
• Stale Rate: proportion of queries for which the search
engine returns stale results
• (Both computable, given enough computing time!)

- 12 -
Some Definitions (2)
 At any given time a CIP may invalidate a query or not:
 True Positive: invalidation of stale query  -
 True Negative: non-invalidation of non-stale query  !
 False Positive: invalidation of non-stale query  $
 False Negative: non-invalidation of stale-query  !
 False Negatives are much more expensive than False Positives:
 User dissatisfaction vs. computational time
 Frequency of query!
 Error spread forward in time!
 True Negatives lead to huge savings (x query volume)

Invalidation Policies – Upon Match
 Upon match: invalidate query q whenever the synopsis of
document d matches q
 E.g., for conjunctive queries, q Í d
 e.g. for a BOW engine, stale rate=0
- 13 -
The Boston Celtics beat the L.A.
Lakers on their home court in the
4th game of the 2010 NBA Finals
 Very low stale rate 
 High FP! $$
URL1 0.875
URL2 0.834
…
URL9 0.692
URL10 0.511
URL1 0.924
URL2 0.876
…
URL9 0.769
URL10 0.631
URL1 0.899
URL2 0.867
…
URL9 0.741
URL10 0.651
URL1 0712
URL2 0.690
…
URL9 0.482
URL10 0.375
Oil Spill
URL1 0.905
URL2 0.704
…
URL9 0.662
URL10 0.583
home
URL1 0.999
URL2 0.888
…
URL9 0.222
URL10 0.111
Boston Celtics Barack Obama World Cup L.A. Lakers

Invalidation Policies – Score Thresholding
 Score Thresholding: invalidate q whenever projected score(q,d) is
high enough (prerequisite: q matches d)
 Requires maintaining the min score per result set, and the
stand-alone ability to compute the score of a synopsis w/respect
to any query
 Reduces FP’s, increases FN’s President Barack Obama criticized
Score=0.503 Score=0.681
- 14 -
BP yesterday for mishandling the
oil spill in the gulf of Mexico
URL1 0.875
URL2 0.834
…
URL9 0.692
URL10 0.511
URL1 0.924
URL2 0.876
…
URL9 0.769
URL10 0.631
URL1 0.899
URL2 0.867
…
URL9 0.741
URL10 0.651
URL1 0712
URL2 0.690
…
URL9 0.482
URL10 0.375
Oil Spill
URL1 0.905
URL2 0.704
…
URL9 0.662
URL10 0.583
home
URL1 0.999
URL2 0.888
…
URL9 0.222
URL10 0.111
Boston Celtics Barack Obama World Cup L.A. Lakers

$CIP Policies – Synopsis Generation  Full synopsis: entire document + all ranking attributes  Idea: reduce synopsis by dropping stuff “unlikely” to affect scoring  Less communication  but more prediction errors   In this paper:  transfer some fraction of top TF-IDF terms  drop document revisions that didn’t “change much” - 15 - We the people of the United States, in Order to form a more perfect Union, . . . Order People Perfect union$

Experimental Setting #1
 Sandbox experiment – static cache containing fixed query set,
controlled document/query dynamics (no interleaving)
 Data Source: en.wikipedia.org
 History span: 2006 – 2008
 2.8 TB, > 1M pages
 Dominated by updates (> 95%)
- 16 -
 Query Source: Y! query log
 2 days of queries with a click on Wikipedia (2.5 M)
 Sample of 10K queries (9234 unique) chosen u.a.r.
 Evaluation pattern:
 120 single-day epochs (~4% change/day)
 The same 10K query batch at the end of each epoch
 Search library: Apache Lucene open-source library

CIP Parameters – Notation Summary
η Fraction of top-terms in synopsis 0 … 1
δ Revision modification threshold 0 … 1
1s Score thresholding applied? 0/1
τ Time-to-live (TTL) threshold 0 .. ∞
Basic CIP: η = 1, δ = 0, τ = ∞, 1s = 0
- 17 -

- 18 -
Baseline Comparison
Policy False
Positives
False
Negatives
Stale Rate
No invalidation
(TTL τ=∞)
0 0.108 0.768 (!)
No caching
(TTL τ=0)
0.892 0 0
TTL τ=2 0.446 0.054 0.055
TTL τ=5 0.179 0.086 0.175
Basic CIP
(Full synopses,
invalidate upon match,
threshold=no, τ=∞)
0.679 0.001 0.008 (!)

CIP Effectiveness: varying 1s, τ, and η
- 19 -
Shrinking synopsis
Growing TTL

CIP Effectiveness: varying 1s and δ
- 20 -
Increasing
revision
threshold

- 21 -
Best-in-Class Picture
??

- 22 -
Conclusions
 The problem of maintaining cached search results over
incremental indexes is real, and under-explored
 We proposed the CIP framework for real-time search cache
management
 We proposed an experimental setting for CIPs
 Demonstrated a simple CIP that significantly improves over prior
art (TTL), and measured sensitivity to various parameters

- 23 -
Future Work
• Analyze a real-world scenario (News)
– More drastic update and query dynamics
– More realistic implementation to measure cost overhead
– Compare to dynamic TTL
• Continue Improving CIPs
– Better synopsis
– Connections between corpus dynamics and query dynamics
• Study relation between real-time caching with CIPs and
pre-fetching of results

Thank you! Questions?
- 24 -

Policy Stability: Curbing Stale Results
- 25 -
Still growing
stable
Still growing but slowly stable

Recommended

[253] apache ni fi

[253] apache ni fi

[253] apache ni fiNAVER D2

Storm crawler apachecon_na_2015

Storm crawler apachecon_na_2015

Storm crawler apachecon_na_2015ontopic

天下武功唯快不破：利用串流資料實做出即時分類器和即時推薦系統

天下武功唯快不破：利用串流資料實做出即時分類器和即時推薦系統

天下武功唯快不破：利用串流資料實做出即時分類器和即時推薦系統台灣資料科學年會

MRT 2018: reflecting on the past and the present with temporal graph models

MRT 2018: reflecting on the past and the present with temporal graph models

MRT 2018: reflecting on the past and the present with temporal graph modelsAntonio García-Domínguez

Realtime Analytics with Storm and Hadoop

Realtime Analytics with Storm and Hadoop

Realtime Analytics with Storm and HadoopDataWorks Summit

Presto at Tivo, Boston Hadoop Meetup

Presto at Tivo, Boston Hadoop Meetup

Presto at Tivo, Boston Hadoop MeetupJustin Borgman

Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...

Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...

Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Kristofferson A

Fire-fighting java big data problems

Fire-fighting java big data problems

Fire-fighting java big data problemsgrepalex

Recommended

[253] apache ni fi

[253] apache ni fi

[253] apache ni fiNAVER D2

Storm crawler apachecon_na_2015

Storm crawler apachecon_na_2015

Storm crawler apachecon_na_2015ontopic

天下武功唯快不破：利用串流資料實做出即時分類器和即時推薦系統

天下武功唯快不破：利用串流資料實做出即時分類器和即時推薦系統

天下武功唯快不破：利用串流資料實做出即時分類器和即時推薦系統台灣資料科學年會

MRT 2018: reflecting on the past and the present with temporal graph models

MRT 2018: reflecting on the past and the present with temporal graph models

MRT 2018: reflecting on the past and the present with temporal graph modelsAntonio García-Domínguez

Realtime Analytics with Storm and Hadoop

Realtime Analytics with Storm and Hadoop

Realtime Analytics with Storm and HadoopDataWorks Summit

Presto at Tivo, Boston Hadoop Meetup

Presto at Tivo, Boston Hadoop Meetup

Presto at Tivo, Boston Hadoop MeetupJustin Borgman

Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...

Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...

Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Kristofferson A

Fire-fighting java big data problems

Fire-fighting java big data problems

Fire-fighting java big data problemsgrepalex

MY PROFILESelva Rajan

Guiding conservation and sustainable use through a national Prunus africana M...

Guiding conservation and sustainable use through a national Prunus africana M...

Guiding conservation and sustainable use through a national Prunus africana M...Verina Ingram

Exploring type-directed, test-driven development: a case study using FizzBuzz

Exploring type-directed, test-driven development: a case study using FizzBuzz

Exploring type-directed, test-driven development: a case study using FizzBuzzFranklin Chen

All YWCA Docsjmingma

G48 53011810075

G48 53011810075

G48 53011810075BenjamasS

The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...

The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...

The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...william.m.thomson

Networking 101Halifax Partnership

Reported statements

Reported statements

Reported statementsVicky

Destination pluto

Destination pluto

Destination plutoLisa Baird

Motion reviewmshenry

#ForoEGovAR | Casos de PSC y su adaptación

#ForoEGovAR | Casos de PSC y su adaptación

#ForoEGovAR | Casos de PSC y su adaptaciónCESSI ArgenTIna

Gic2011 aula10-ingles

Gic2011 aula10-ingles

Gic2011 aula10-inglesMarielba-Mayeya Zacarias

Strategic research agenda for cocoa coffee Wageningen UR 09062014

Strategic research agenda for cocoa coffee Wageningen UR 09062014

Strategic research agenda for cocoa coffee Wageningen UR 09062014Verina Ingram

Recent Developments in Aviation Law

Recent Developments in Aviation Law

Recent Developments in Aviation LawStites & Harbison

Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...

Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...

Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...Verina Ingram

Presentation1Becca McPartland

Web 2.0..Business Friend or Foe?

Web 2.0..Business Friend or Foe?

Web 2.0..Business Friend or Foe?Stites & Harbison

Harry potter and the deathly hallows review.

Harry potter and the deathly hallows review.

Harry potter and the deathly hallows review.Becca McPartland

Halifax’s Finance and Insurance Industry: Our Opportunity

Halifax’s Finance and Insurance Industry: Our Opportunity

Halifax’s Finance and Insurance Industry: Our OpportunityHalifax Partnership

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...Petter Skodvin-Hvammen

Scylla Summit 2018: OLAP or OLTP? Why Not Both?

Scylla Summit 2018: OLAP or OLTP? Why Not Both?

Scylla Summit 2018: OLAP or OLTP? Why Not Both?ScyllaDB

More Related Content

Viewers also liked

MY PROFILESelva Rajan

Guiding conservation and sustainable use through a national Prunus africana M...

Guiding conservation and sustainable use through a national Prunus africana M...

Guiding conservation and sustainable use through a national Prunus africana M...Verina Ingram

Exploring type-directed, test-driven development: a case study using FizzBuzz

Exploring type-directed, test-driven development: a case study using FizzBuzz

Exploring type-directed, test-driven development: a case study using FizzBuzzFranklin Chen

All YWCA Docsjmingma

G48 53011810075

G48 53011810075

G48 53011810075BenjamasS

The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...

The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...

The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...william.m.thomson

Networking 101Halifax Partnership

Reported statements

Reported statements

Reported statementsVicky

Destination pluto

Destination pluto

Destination plutoLisa Baird

Motion reviewmshenry

#ForoEGovAR | Casos de PSC y su adaptación

#ForoEGovAR | Casos de PSC y su adaptación

#ForoEGovAR | Casos de PSC y su adaptaciónCESSI ArgenTIna

Gic2011 aula10-ingles

Gic2011 aula10-ingles

Gic2011 aula10-inglesMarielba-Mayeya Zacarias

Strategic research agenda for cocoa coffee Wageningen UR 09062014

Strategic research agenda for cocoa coffee Wageningen UR 09062014

Strategic research agenda for cocoa coffee Wageningen UR 09062014Verina Ingram

Recent Developments in Aviation Law

Recent Developments in Aviation Law

Recent Developments in Aviation LawStites & Harbison

Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...

Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...

Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...Verina Ingram

Presentation1Becca McPartland

Web 2.0..Business Friend or Foe?

Web 2.0..Business Friend or Foe?

Web 2.0..Business Friend or Foe?Stites & Harbison

Harry potter and the deathly hallows review.

Harry potter and the deathly hallows review.

Harry potter and the deathly hallows review.Becca McPartland

Halifax’s Finance and Insurance Industry: Our Opportunity

Halifax’s Finance and Insurance Industry: Our Opportunity

Halifax’s Finance and Insurance Industry: Our OpportunityHalifax Partnership

Viewers also liked (20)

MY PROFILE

Guiding conservation and sustainable use through a national Prunus africana M...

Guiding conservation and sustainable use through a national Prunus africana M...

Guiding conservation and sustainable use through a national Prunus africana M...

Exploring type-directed, test-driven development: a case study using FizzBuzz

Exploring type-directed, test-driven development: a case study using FizzBuzz

Exploring type-directed, test-driven development: a case study using FizzBuzz

All YWCA Docs

G48 53011810075

G48 53011810075

G48 53011810075

Pres

The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...

The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...

The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...

Networking 101

Reported statements

Reported statements

Reported statements

Destination pluto

Destination pluto

Destination pluto

Motion review

#ForoEGovAR | Casos de PSC y su adaptación

#ForoEGovAR | Casos de PSC y su adaptación

#ForoEGovAR | Casos de PSC y su adaptación

Gic2011 aula10-ingles

Gic2011 aula10-ingles

Gic2011 aula10-ingles

Strategic research agenda for cocoa coffee Wageningen UR 09062014

Strategic research agenda for cocoa coffee Wageningen UR 09062014

Strategic research agenda for cocoa coffee Wageningen UR 09062014

Recent Developments in Aviation Law

Recent Developments in Aviation Law

Recent Developments in Aviation Law

Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...

Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...

Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...

Presentation1

Web 2.0..Business Friend or Foe?

Web 2.0..Business Friend or Foe?

Web 2.0..Business Friend or Foe?

Harry potter and the deathly hallows review.

Harry potter and the deathly hallows review.

Harry potter and the deathly hallows review.

Halifax’s Finance and Insurance Industry: Our Opportunity

Halifax’s Finance and Insurance Industry: Our Opportunity

Halifax’s Finance and Insurance Industry: Our Opportunity

Similar to Caching Search Engine Results over Incremental Indices

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...Petter Skodvin-Hvammen

Scylla Summit 2018: OLAP or OLTP? Why Not Both?

Scylla Summit 2018: OLAP or OLTP? Why Not Both?

Scylla Summit 2018: OLAP or OLTP? Why Not Both?ScyllaDB

System and User Aspects of Web Search Latency

System and User Aspects of Web Search Latency

System and User Aspects of Web Search LatencyTelefonica Research

BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...

BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...

BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...Dirk Petersen

Splunk Ninjas: New Features, Pivot, and Search Dojo

Splunk Ninjas: New Features, Pivot, and Search Dojo

Splunk Ninjas: New Features, Pivot, and Search DojoSplunk

Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...

Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...

Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...ORAU

Oracle DB In-Memory technologie v kombinaci s procesorem M7

Oracle DB In-Memory technologie v kombinaci s procesorem M7

Oracle DB In-Memory technologie v kombinaci s procesorem M7MarketingArrowECS_CZ

Macy's: Changing Engines in Mid-Flight

Macy's: Changing Engines in Mid-Flight

Macy's: Changing Engines in Mid-FlightDataStax Academy

SQL on Hadoop benchmarks using TPC-DS query set

SQL on Hadoop benchmarks using TPC-DS query set

SQL on Hadoop benchmarks using TPC-DS query setKognitio

Azure stream analytics by Nico Jacobs

Azure stream analytics by Nico Jacobs

Azure stream analytics by Nico JacobsITProceed

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin

Swift design session - public object storage scalability

Swift design session - public object storage scalability

Swift design session - public object storage scalabilityAlan Jiang

Real World Performance - Data Warehouses

Real World Performance - Data Warehouses

Real World Performance - Data WarehousesConnor McDonald

Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?

Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?

Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Jim Czuprynski

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day

Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media

Faster Faster Faster! Datamarts with Hive at Yahoo

Faster Faster Faster! Datamarts with Hive at Yahoo

Faster Faster Faster! Datamarts with Hive at YahooMithun Radhakrishnan

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData

Oow2016 review-db-dev-bigdata-BI

Oow2016 review-db-dev-bigdata-BI

Oow2016 review-db-dev-bigdata-BIGetting value from IoT, Integration and Data Analytics

Similar to Caching Search Engine Results over Incremental Indices (20)

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...

Scylla Summit 2018: OLAP or OLTP? Why Not Both?

Scylla Summit 2018: OLAP or OLTP? Why Not Both?

Scylla Summit 2018: OLAP or OLTP? Why Not Both?

System and User Aspects of Web Search Latency

System and User Aspects of Web Search Latency

System and User Aspects of Web Search Latency

BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...

BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...

BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...

Splunk Ninjas: New Features, Pivot, and Search Dojo

Splunk Ninjas: New Features, Pivot, and Search Dojo

Splunk Ninjas: New Features, Pivot, and Search Dojo

Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...

Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...

Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...

Oracle DB In-Memory technologie v kombinaci s procesorem M7

Oracle DB In-Memory technologie v kombinaci s procesorem M7

Oracle DB In-Memory technologie v kombinaci s procesorem M7

Macy's: Changing Engines in Mid-Flight

Macy's: Changing Engines in Mid-Flight

Macy's: Changing Engines in Mid-Flight

SQL on Hadoop benchmarks using TPC-DS query set

SQL on Hadoop benchmarks using TPC-DS query set

SQL on Hadoop benchmarks using TPC-DS query set

Azure stream analytics by Nico Jacobs

Azure stream analytics by Nico Jacobs

Azure stream analytics by Nico Jacobs

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Swift design session - public object storage scalability

Swift design session - public object storage scalability

Swift design session - public object storage scalability

Real World Performance - Data Warehouses

Real World Performance - Data Warehouses

Real World Performance - Data Warehouses

Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?

Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?

Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day

Faster Faster Faster! Datamarts with Hive at Yahoo

Faster Faster Faster! Datamarts with Hive at Yahoo

Faster Faster Faster! Datamarts with Hive at Yahoo

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Oow2016 review-db-dev-bigdata-BI

Oow2016 review-db-dev-bigdata-BI

Oow2016 review-db-dev-bigdata-BI

More from Roi Blanco

From Queries to Answers in the Web

From Queries to Answers in the Web

From Queries to Answers in the WebRoi Blanco

Entity Linking via Graph-Distance Minimization

Entity Linking via Graph-Distance Minimization

Entity Linking via Graph-Distance MinimizationRoi Blanco

Introduction to Big Data

Introduction to Big Data

Introduction to Big DataRoi Blanco

Mining Web content for Enhanced Search

Mining Web content for Enhanced Search

Mining Web content for Enhanced Search Roi Blanco

Influence of Timeline and Named-entity Components on User Engagement

Influence of Timeline and Named-entity Components on User Engagement

Influence of Timeline and Named-entity Components on User Engagement Roi Blanco

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information RetrievalRoi Blanco

Searching over the past, present and future

Searching over the past, present and future

Searching over the past, present and futureRoi Blanco

Beyond document retrieval using semantic annotations

Beyond document retrieval using semantic annotations

Beyond document retrieval using semantic annotations Roi Blanco

Keyword Search over RDF Graphs

Keyword Search over RDF Graphs

Keyword Search over RDF GraphsRoi Blanco

Large-Scale Semantic Search

Large-Scale Semantic Search

Large-Scale Semantic SearchRoi Blanco

Extending BM25 with multiple query operators

Extending BM25 with multiple query operators

Extending BM25 with multiple query operatorsRoi Blanco

Energy-Price-Driven Query Processing in Multi-center WebSearch Engines

Energy-Price-Driven Query Processing in Multi-center WebSearch Engines

Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesRoi Blanco

Effective and Efficient Entity Search in RDF data

Effective and Efficient Entity Search in RDF data

Effective and Efficient Entity Search in RDF dataRoi Blanco

Finding support sentences for entities

Finding support sentences for entities

Finding support sentences for entitiesRoi Blanco

More from Roi Blanco (14)

From Queries to Answers in the Web

From Queries to Answers in the Web

From Queries to Answers in the Web

Entity Linking via Graph-Distance Minimization

Entity Linking via Graph-Distance Minimization

Entity Linking via Graph-Distance Minimization

Introduction to Big Data

Introduction to Big Data

Introduction to Big Data

Mining Web content for Enhanced Search

Mining Web content for Enhanced Search

Mining Web content for Enhanced Search

Influence of Timeline and Named-entity Components on User Engagement

Influence of Timeline and Named-entity Components on User Engagement

Influence of Timeline and Named-entity Components on User Engagement

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Searching over the past, present and future

Searching over the past, present and future

Searching over the past, present and future

Beyond document retrieval using semantic annotations

Beyond document retrieval using semantic annotations

Beyond document retrieval using semantic annotations

Keyword Search over RDF Graphs

Keyword Search over RDF Graphs

Keyword Search over RDF Graphs

Large-Scale Semantic Search

Large-Scale Semantic Search

Large-Scale Semantic Search

Extending BM25 with multiple query operators

Extending BM25 with multiple query operators

Extending BM25 with multiple query operators

Energy-Price-Driven Query Processing in Multi-center WebSearch Engines

Energy-Price-Driven Query Processing in Multi-center WebSearch Engines

Energy-Price-Driven Query Processing in Multi-center WebSearch Engines

Effective and Efficient Entity Search in RDF data

Effective and Efficient Entity Search in RDF data

Effective and Efficient Entity Search in RDF data

Finding support sentences for entities

Finding support sentences for entities

Finding support sentences for entities

Recently uploaded

Search Engine Optimization SEO PDF for 2024.pdf

Search Engine Optimization SEO PDF for 2024.pdf

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Designing IA for AI - Information Architecture Conference 2024

Designing IA for AI - Information Architecture Conference 2024

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

DMCC Future of Trade Web3 - Special Edition

DMCC Future of Trade Web3 - Special Edition

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

My Hashitalk Indonesia April 2024 Presentation

My Hashitalk Indonesia April 2024 Presentation

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Anypoint Exchange: It’s Not Just a Repo!

Anypoint Exchange: It’s Not Just a Repo!

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

The Future of Software Development - Devin AI Innovative Approach.pdf

The Future of Software Development - Devin AI Innovative Approach.pdf

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

Unraveling Multimodality with Large Language Models.pdf

Unraveling Multimodality with Large Language Models.pdf

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Artificial intelligence in cctv survelliance.pptx

Artificial intelligence in cctv survelliance.pptx

Artificial intelligence in cctv survelliance.pptxhariprasad279825

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Commit 2024 - Secret Management made easy

Commit 2024 - Secret Management made easy

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Connect Wave/ connectwave Pitch Deck Presentation

Connect Wave/ connectwave Pitch Deck Presentation

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Streamlining Python Development: A Guide to a Modern Project Setup

Streamlining Python Development: A Guide to a Modern Project Setup

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

My INSURER PTE LTD - Insurtech Innovation Award 2024

My INSURER PTE LTD - Insurtech Innovation Award 2024

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Developer Data Modeling Mistakes: From Postgres to NoSQL

Developer Data Modeling Mistakes: From Postgres to NoSQL

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

SAP Build Work Zone - Overview L2-L3.pptx

SAP Build Work Zone - Overview L2-L3.pptx

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Vertex AI Gemini Prompt Engineering Tips

Vertex AI Gemini Prompt Engineering Tips

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Training state-of-the-art general text embedding

Training state-of-the-art general text embedding

Training state-of-the-art general text embeddingZilliz

Recently uploaded (20)

Search Engine Optimization SEO PDF for 2024.pdf

Search Engine Optimization SEO PDF for 2024.pdf

Search Engine Optimization SEO PDF for 2024.pdf

Designing IA for AI - Information Architecture Conference 2024

Designing IA for AI - Information Architecture Conference 2024

Designing IA for AI - Information Architecture Conference 2024

DMCC Future of Trade Web3 - Special Edition

DMCC Future of Trade Web3 - Special Edition

DMCC Future of Trade Web3 - Special Edition

My Hashitalk Indonesia April 2024 Presentation

My Hashitalk Indonesia April 2024 Presentation

My Hashitalk Indonesia April 2024 Presentation

Anypoint Exchange: It’s Not Just a Repo!

Anypoint Exchange: It’s Not Just a Repo!

Anypoint Exchange: It’s Not Just a Repo!

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

The Future of Software Development - Devin AI Innovative Approach.pdf

The Future of Software Development - Devin AI Innovative Approach.pdf

The Future of Software Development - Devin AI Innovative Approach.pdf

Unraveling Multimodality with Large Language Models.pdf

Unraveling Multimodality with Large Language Models.pdf

Unraveling Multimodality with Large Language Models.pdf

Artificial intelligence in cctv survelliance.pptx

Artificial intelligence in cctv survelliance.pptx

Artificial intelligence in cctv survelliance.pptx

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Commit 2024 - Secret Management made easy

Commit 2024 - Secret Management made easy

Commit 2024 - Secret Management made easy

Connect Wave/ connectwave Pitch Deck Presentation

Connect Wave/ connectwave Pitch Deck Presentation

Connect Wave/ connectwave Pitch Deck Presentation

Streamlining Python Development: A Guide to a Modern Project Setup

Streamlining Python Development: A Guide to a Modern Project Setup

Streamlining Python Development: A Guide to a Modern Project Setup

My INSURER PTE LTD - Insurtech Innovation Award 2024

My INSURER PTE LTD - Insurtech Innovation Award 2024

My INSURER PTE LTD - Insurtech Innovation Award 2024

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

WordPress Websites for Engineers: Elevate Your Brand

Developer Data Modeling Mistakes: From Postgres to NoSQL

Developer Data Modeling Mistakes: From Postgres to NoSQL

Developer Data Modeling Mistakes: From Postgres to NoSQL

SAP Build Work Zone - Overview L2-L3.pptx

SAP Build Work Zone - Overview L2-L3.pptx

SAP Build Work Zone - Overview L2-L3.pptx

Vertex AI Gemini Prompt Engineering Tips

Vertex AI Gemini Prompt Engineering Tips

Vertex AI Gemini Prompt Engineering Tips

Training state-of-the-art general text embedding

Training state-of-the-art general text embedding

Training state-of-the-art general text embedding

Caching Search Engine Results over Incremental Indices

1. Caching Search Engine Results over Incremental Indices Y! Research Barcelona Y! Labs Haifa Roi Blanco Edward Bortnikov Flavio Junqueira Ronny Lempel Luca Telloli * Hugo Zaragoza •* currently at Barcelona Supercomputing Center

2. - 2 - Overview  Caching (Background & Prior Art)  Cache Invalidation Predictors  Experimental Setup  Results  Conclusions and Future Work

3. CHIigPh A Lrcehvietle Actrucrheitecture of Search Engines Cache Query results Runtime system Parser/ Tokenizer - 3 - Index terms Engine queries Indexing pipeline W WWWWW

4. Web Search Results Caching Caching of Web Search Results is crucial: • Query stream is extremely REDUNDANT and BURSTY – Zipfian distribution of query popularity (redundant) – Extreme trending of topics (bursty) • CACHE {q}  Search_Results(q) - 4 - • Caching benefits : – Shorten the engine’s response time (user waiting) – Lower the number/cost of query executions (# data centers) • Caveat: data (pages) is constantly changing!

5. Caching Search Engine Results – Prior Art • Markatos, 2001: applied classical replacement policies (LRU, SLRU) to a 1M query log from Excite; demonstrated hit rates of ~30% • Replacement policies tailored for search engines: – PDC: Probability Driven Cache (Lempel & Moran, 2003) – SDC: Static/Dynamic Cache (Silvestri, Fagni, Orlando, Palmerini and Perego, 2003) – AC: Admission-based Caching (Baeza-Yates, Junqueira, Plachouras and Witschel, 2007) • Other observations and approaches: – Lempel & Moran, 2004: theoretical study via competitive analysis – Gan & Suel, 2009: optimizing query evaluation work rather than hit rates – Cambazoglu, Junqueira, Plachouras, Banachowski, Cui, Lim, and Bridge, 2010: refreshing aged entries during times of low back-end load - 5 -

6. - 6 - Traditional View:  Dilemma: Freshness versus Computation  Extreme #1: do not cache at all – evaluate all queries  100% fresh results, lots of redundant evaluations  Extreme #2: never invalidate the cache  A majority of stale results – results refreshed only due to cache replacement, no redundant work  Middle ground: invalidate periodically (TTL)  A time-to-live parameter is applied to each cached entry

7. Caching in the Presence of Index Changes • Increasing importance of freshness in search: – News, Blogs, Twitter, Social, Reviews, Local… • Moving towards “Real Time Crawling”: – Latency measured in seconds instead of hours or days. • Caching, by definition, returns OLD results – Traditionally as TTL  0, caching hit rate  0 • Can we have our cake and eat it too? – Can a cache operate on a very-fast changing collection? - 7 -

8. Cache Invalidation Predictors - 8 -  Main idea:  Not ALL the documents change ALL the time.  When a document changes (or is created / deleted) remove from the cache any quires that may have returned it.  e.g: a new document on Spanish cooking arrives; no need to invalidate queries about quantum physics!  Cache Invalidator Predictors (CIP): 1. Capture document insertions/updates as they enter the index 2. Using document features, predict which cached search results will be affected by the updates 3. Invalidate those cached entries (equivalent to eviction) 4. Upon document deletion, invalidate any cached entry containing that document

9. Cache Invalidation Predictor Architecture - 9 - CIP Architecture Legend Runtime system Parser/ Tokenizer Index terms Cache Query Engine CIP Synopsis generator queries Indexing pipeline Data flow API calls

10. The Invalidator: Brief Implementation Notes • The CIP needs to quickly locate, given a synopsis (e.g. document), which cached entries it matches • Essentially a reversed search engine: – The synopses are the queries – The queries (whose results are cached) are the documents • (!) Non-negligible cost of communication, indexing and querying. - 10 -

11. - 11 - Some Definitions • At any given time, a query in the cache may be: – Stale: cache entry no longer represents the results the engine would return for q – or not stale. • Stale Rate: proportion of queries for which the search engine returns stale results • (Both computable, given enough computing time!)

12. - 12 - Some Definitions (2)  At any given time a CIP may invalidate a query or not:  True Positive: invalidation of stale query  -  True Negative: non-invalidation of non-stale query  !  False Positive: invalidation of non-stale query  $  False Negative: non-invalidation of stale-query  !  False Negatives are much more expensive than False Positives:  User dissatisfaction vs. computational time  Frequency of query!  Error spread forward in time!  True Negatives lead to huge savings (x query volume)

13. Invalidation Policies – Upon Match  Upon match: invalidate query q whenever the synopsis of document d matches q  E.g., for conjunctive queries, q Í d  e.g. for a BOW engine, stale rate=0 - 13 - The Boston Celtics beat the L.A. Lakers on their home court in the 4th game of the 2010 NBA Finals  Very low stale rate   High FP! $$ URL1 0.875 URL2 0.834 … URL9 0.692 URL10 0.511 URL1 0.924 URL2 0.876 … URL9 0.769 URL10 0.631 URL1 0.899 URL2 0.867 … URL9 0.741 URL10 0.651 URL1 0712 URL2 0.690 … URL9 0.482 URL10 0.375 Oil Spill URL1 0.905 URL2 0.704 … URL9 0.662 URL10 0.583 home URL1 0.999 URL2 0.888 … URL9 0.222 URL10 0.111 Boston Celtics Barack Obama World Cup L.A. Lakers

14. Invalidation Policies – Score Thresholding  Score Thresholding: invalidate q whenever projected score(q,d) is high enough (prerequisite: q matches d)  Requires maintaining the min score per result set, and the stand-alone ability to compute the score of a synopsis w/respect to any query  Reduces FP’s, increases FN’s President Barack Obama criticized Score=0.503 Score=0.681 - 14 - BP yesterday for mishandling the oil spill in the gulf of Mexico URL1 0.875 URL2 0.834 … URL9 0.692 URL10 0.511 URL1 0.924 URL2 0.876 … URL9 0.769 URL10 0.631 URL1 0.899 URL2 0.867 … URL9 0.741 URL10 0.651 URL1 0712 URL2 0.690 … URL9 0.482 URL10 0.375 Oil Spill URL1 0.905 URL2 0.704 … URL9 0.662 URL10 0.583 home URL1 0.999 URL2 0.888 … URL9 0.222 URL10 0.111 Boston Celtics Barack Obama World Cup L.A. Lakers

15. CIP Policies – Synopsis Generation  Full synopsis: entire document + all ranking attributes  Idea: reduce synopsis by dropping stuff “unlikely” to affect scoring  Less communication  but more prediction errors   In this paper:  transfer some fraction of top TF-IDF terms  drop document revisions that didn’t “change much” - 15 - We the people of the United States, in Order to form a more perfect Union, . . . Order People Perfect union

16. Experimental Setting #1  Sandbox experiment – static cache containing fixed query set, controlled document/query dynamics (no interleaving)  Data Source: en.wikipedia.org  History span: 2006 – 2008  2.8 TB, > 1M pages  Dominated by updates (> 95%) - 16 -  Query Source: Y! query log  2 days of queries with a click on Wikipedia (2.5 M)  Sample of 10K queries (9234 unique) chosen u.a.r.  Evaluation pattern:  120 single-day epochs (~4% change/day)  The same 10K query batch at the end of each epoch  Search library: Apache Lucene open-source library

17. CIP Parameters – Notation Summary η Fraction of top-terms in synopsis 0 … 1 δ Revision modification threshold 0 … 1 1s Score thresholding applied? 0/1 τ Time-to-live (TTL) threshold 0 .. ∞ Basic CIP: η = 1, δ = 0, τ = ∞, 1s = 0 - 17 -

18. - 18 - Baseline Comparison Policy False Positives False Negatives Stale Rate No invalidation (TTL τ=∞) 0 0.108 0.768 (!) No caching (TTL τ=0) 0.892 0 0 TTL τ=2 0.446 0.054 0.055 TTL τ=5 0.179 0.086 0.175 Basic CIP (Full synopses, invalidate upon match, threshold=no, τ=∞) 0.679 0.001 0.008 (!)

19. CIP Effectiveness: varying 1s, τ, and η - 19 - Shrinking synopsis Growing TTL

20. CIP Effectiveness: varying 1s and δ - 20 - Increasing revision threshold

21. - 21 - Best-in-Class Picture ??

22. - 22 - Conclusions  The problem of maintaining cached search results over incremental indexes is real, and under-explored  We proposed the CIP framework for real-time search cache management  We proposed an experimental setting for CIPs  Demonstrated a simple CIP that significantly improves over prior art (TTL), and measured sensitivity to various parameters

23. - 23 - Future Work • Analyze a real-world scenario (News) – More drastic update and query dynamics – More realistic implementation to measure cost overhead – Compare to dynamic TTL • Continue Improving CIPs – Better synopsis – Connections between corpus dynamics and query dynamics • Study relation between real-time caching with CIPs and pre-fetching of results

24. Thank you! Questions? - 24 -

25. Policy Stability: Curbing Stale Results - 25 - Still growing stable Still growing but slowly stable