This document discusses the engineering challenges of vertical search engines. It describes how vertical search differs from traditional search by querying structured data instead of free text. It outlines the challenges of retrieving web data at scale, processing unstructured data, building distributed data infrastructures, performing vertical search, conducting data analytics, and implementing computational advertising on structured datasets. The document emphasizes that vertical search at web scale provides many opportunities for research and engineering across computer science fields.
1. +
Engineering Challenges
in Vertical Search Engines
Aleksandar Bradic, Senior Director,
Engineering and R&D, Vast.com
2. +
Introduction
Vertical Search
Search focused on vertical data
Vertical Data – data inherently described by it’s structure:
Items/Properties for sale (Automotive, Real Estate..)
Geographical Data (Neighborhoods, Locations..)
Services (Hotels, Transportation..)
Businesses (Restaurants, Nightlife..)
Events (Concerts, Plays..)
Auction items (Collectibles, Art..)
Metadata (News, Social Data, Reviews..)
…
3. +
Introduction
Vertical Search != Full Text Search
Full Text Search queries:
“Cheap tickets for Broadway shows this week”
“Trendy Restaurants in San Francisco near SoMa”
“3-day trips from NYC to anywhere under $1000”
Vertical Search queries:
“price-sorted results bellow two standard deviations from tickets
category with Broadway as location and date range of 2010-04-11 to
2010-04-18”
“distance-sorted results relative to center of SF/SoMa matching the
appropriate threshold of composite score of user review scores and
historical change in query/review volume”
“total cost-sorted results for all 3-day intervals within next 6 months
combining hotel and airfare price bellow max value of $1000 for all
valid locations”
4. +
Introduction
Vertical Search = search on structured data
Vertical Search at Web-Scale:
Web-Scale datasets
Web-Scale query volumes
Interactive operation
Low latency requirements
Utility maximization across all involved parties
=> loads of fun ! : )
6. +
@Vast.com
Daily processing up to 1Tb of unstructured and semi-
structured Web data
Managing ~150M records operational dataset across multiple
verticals
Handling > 1000 query/sec peak search query loads
We’re hiring ! : )
7. +
Challenges in Vertical Search
Engines
Web Data Retrieval
Unstructured Data
Data Processing Infrastructures
Vertical Search
Data Analytics
Computational Advertising
9. +
Web Data Retrieval
”Deep Web” crawling
Locating Deep Web Content Sources
Selecting Relevant Sources
Estimating Database Size
Understanding Content / Form Detection
Automatic Dispatch of HTML Forms
Predicting content in free text forms
Crawling non-HTML Content
Estimating Query Result Sparsity
URL Generation problem
Query Covering Problem
11. +
Unstructured Data
Unstructured Data – information that does not have a pre-
defined data model
Handling Unstructured Data:
Data Cleaning
Tagging with Metadata
Vertical Classification
Schema Matching
Information Extraction
Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!
Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!
make model year trim price ???
12. +
Unstructured Data
Information extraction from unstructured, ungrammatical
data
Reference Sets - relational data sets that consist of collection of
known entities with associated common attributes
Reference Set Selection
Reference Set Generation
Record Linkage : Finding “best matching” member of reference
set corresponding post
Challenge : Automatic Generation of Reference Sets
13. +
Data Processing Infrastructures
Infrastructures for continuous processing of unbounded streams
of unstructured data
Information Extraction as part of processing (non-trivial
computation per each processed entry)
Inherently distributed infrastructures - in order to support
performance and scalability
Time-to-site constraints. Ability to process out-of band data.
Support for complex operations on aggregated data (de-
duplication, static ranking, data enrichment, data cleaning/
filtering …)
Support for data archival and off-line analysis
15. +
Data Processing Infrastructures
Distributed Computing Platforms:
Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)
Stream-oriented (Flume, S4, Stream SQL…)
Distributed Data Stores (Dynamo/Cassandra/Riak…)
The curse of CAP Theorem:
It is impossible for a distributed system to simultaneously provide
all three of the following guarantees:
Consistency
Availability
Partition tolerance
16. +
Vertical Search
Large-Scale structured data search
Providing both analytic and canonical set of Information
Retrieval functionalities
Entries are represented in Vector Space Model
Each result is represented as data point – tuple consisting of
appropriate number of fields :
(make, model, year, trim …)
17. +
Vertical Search
Search in Vector Space Model
Resulting subset generation
Sorting as linearization using selected metric
Dynamic subset criteria calculation
Search Result Clustering
“Similar” result search
…
… with up to ~100 ms milliseconds response time
… at 10M+ records in index
… handling 100+ queries/sec/host
18. +
Vertical Search
Faceted Search
fac-et (fas’it) :
1. One of the flat polished surfaces cut on a gemstone or occurring
naturally on a crystal.
2. One of numerous aspects, as of a subject.
Vocabulary problem for faceted data
Facet Design / selection
"the keywords that are assigned by indexers are often at
odds with those tried by searchers.”
Selection of information-distinguishing facet values
User-specific faceted search
Dynamic correlated facet generation
Distributing facet computation
19. +
Data Analytics
Clickstream Data Analysis
Learning from implicit user feedback
Anonymous user clustering
Learning to rank
Inventory/Market Trends
Rare Event detection
Price Prediction
Spam Content detection
20. +
Data Analytics
Challenges:
“Good Deal” detection
Recommendation Systems for Vertical Data with no explicit user
feedback
Accuracy of Automatic Valuation Models
Data-driven feature design
Click Prediction
User Behavior Modeling
21. +
Computational Advertising
The central problem of computational advertising is to find
the "best match" between a given user in a given context and a
suitable advertisement.
ads
ads
search results !
22. +
Computational Advertising
Vertical Search presents an additional challenge in the sense
that any of the actual search results can be “sponsored”
ad ?
ad ?
23. +
Computational Advertising
Central challenge:
Find the “best match” between a given user in a given context
and a suitable advertisement
“best match” – maximizing the value for :
Users
Advertisers
Publishers
Each of the parties has different set of utilities:
Users want relevance
Advertisers want ROI and volume
Publishers want revenue per impression/search
25. +
Computational Advertising
Analytical Aparatus:
Regression Analysis (Linear, Logistic, probit model, High
Dimensional methods)
Game Theory (Nash Equilibria, dominant strategy)
Auction Theory (Vickrey, GSP, VCG…)
Graph Theory (random walks on graphs, graph matching, etc.)
Information Retrieval Techniques (similarity metrics, etc.)
…
26. +
Conclusion
Vertical Search & Analytics at Web Scale == fun !!!
Source of large number of relevant research & engineering
problems !
Opportunity to tackle wide spectra of techniques across all
areas of Computer Science and Engineering !
Jump on the bandwagon ! : )