SlideShare a Scribd company logo
1 of 26
Download to read offline
+




    Engineering Challenges
    in Vertical Search Engines
    Aleksandar Bradic, Senior Director,
    Engineering and R&D, Vast.com
+
    Introduction

        Vertical Search
             Search focused on vertical data
             Vertical Data – data inherently described by it’s structure:
                Items/Properties for sale (Automotive, Real Estate..)

                  Geographical Data (Neighborhoods, Locations..)
                  Services (Hotels, Transportation..)
                  Businesses (Restaurants, Nightlife..)
                  Events (Concerts, Plays..)
                  Auction items (Collectibles, Art..)
                  Metadata (News, Social Data, Reviews..)
                  …
+
    Introduction

        Vertical Search != Full Text Search
             Full Text Search queries:
                “Cheap tickets for Broadway shows this week”
                “Trendy Restaurants in San Francisco near SoMa”
                “3-day trips from NYC to anywhere under $1000”
             Vertical Search queries:
                “price-sorted results bellow two standard deviations from tickets
                 category with Broadway as location and date range of 2010-04-11 to
                 2010-04-18”
                “distance-sorted results relative to center of SF/SoMa matching the
                 appropriate threshold of composite score of user review scores and
                 historical change in query/review volume”
                “total cost-sorted results for all 3-day intervals within next 6 months
                 combining hotel and airfare price bellow max value of $1000 for all
                 valid locations”
+
    Introduction

        Vertical Search = search on structured data

        Vertical Search at Web-Scale:
             Web-Scale datasets
             Web-Scale query volumes
             Interactive operation
             Low latency requirements
             Utility maximization across all involved parties

        => loads of fun ! : )
+
    @Vast.com

        Vast.com : Vertical Search & Analytics Platform

        Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest
         Airlines, etc..
+
    @Vast.com

        Daily processing up to 1Tb of unstructured and semi-
         structured Web data

        Managing ~150M records operational dataset across multiple
         verticals

        Handling > 1000 query/sec peak search query loads



        We’re hiring ! : )
+
    Challenges in Vertical Search
    Engines
        Web Data Retrieval

        Unstructured Data

        Data Processing Infrastructures

        Vertical Search

        Data Analytics

        Computational Advertising
+
    Web Data Retrieval

        Crawler Architecture
             Queue Management
             Crawl Ordering Policies
             Duplicate URL Detection
             Content Hash Management
             Politeness Management
             Coverage Measurement
             Freshness Optimization
             Incremental Crawling
+
    Web Data Retrieval

        ”Deep Web” crawling
             Locating Deep Web Content Sources
             Selecting Relevant Sources
             Estimating Database Size
             Understanding Content / Form Detection
             Automatic Dispatch of HTML Forms
             Predicting content in free text forms
             Crawling non-HTML Content
             Estimating Query Result Sparsity
             URL Generation problem
             Query Covering Problem
+
    Web Data Retrieval

        Focused (Topical) Crawling
             Content Classification
             Link Content Prediction
             Topic Relevance Estimation

        Modeling Temporal Characteristics
             Site-Level Evolution
             Page-Level Evolution

        Adversarial Crawling
             Web Spam Detection
             Cloaked Content Detection
+
    Unstructured Data

        Unstructured Data – information that does not have a pre-
         defined data model

        Handling Unstructured Data:
             Data Cleaning
             Tagging with Metadata
             Vertical Classification
             Schema Matching
             Information Extraction


    Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!

    Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!
make            model   year    trim          price                  ???
+
    Unstructured Data

        Information extraction from unstructured, ungrammatical
         data
             Reference Sets - relational data sets that consist of collection of
              known entities with associated common attributes
             Reference Set Selection
             Reference Set Generation
             Record Linkage : Finding “best matching” member of reference
              set corresponding post
             Challenge : Automatic Generation of Reference Sets
+
    Data Processing Infrastructures

        Infrastructures for continuous processing of unbounded streams
         of unstructured data
        Information Extraction as part of processing (non-trivial
         computation per each processed entry)

        Inherently distributed infrastructures - in order to support
         performance and scalability

        Time-to-site constraints. Ability to process out-of band data.

        Support for complex operations on aggregated data (de-
         duplication, static ranking, data enrichment, data cleaning/
         filtering …)

        Support for data archival and off-line analysis
+
    Data Processing Infrastructures
+
    Data Processing Infrastructures

        Distributed Computing Platforms:

             Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)

             Stream-oriented (Flume, S4, Stream SQL…)

             Distributed Data Stores (Dynamo/Cassandra/Riak…)

        The curse of CAP Theorem:
             It is impossible for a distributed system to simultaneously provide
              all three of the following guarantees:
                Consistency
                Availability
                Partition tolerance
+
    Vertical Search

        Large-Scale structured data search

        Providing both analytic and canonical set of Information
         Retrieval functionalities

        Entries are represented in Vector Space Model

        Each result is represented as data point – tuple consisting of
         appropriate number of fields :

         (make, model, year, trim …)
+
    Vertical Search

        Search in Vector Space Model
             Resulting subset generation
             Sorting as linearization using selected metric
             Dynamic subset criteria calculation
             Search Result Clustering
             “Similar” result search
             …



… with up to ~100 ms milliseconds response time
… at 10M+ records in index
… handling 100+ queries/sec/host
+
    Vertical Search

        Faceted Search
             fac-et (fas’it) :
                1. One of the flat polished surfaces cut on a gemstone or occurring
                 naturally on a crystal.
                2. One of numerous aspects, as of a subject.


             Vocabulary problem for faceted data
             Facet Design / selection
                "the keywords that are assigned by indexers are often at
                  odds with those tried by searchers.”
                Selection of information-distinguishing facet values
             User-specific faceted search
             Dynamic correlated facet generation
             Distributing facet computation
+
    Data Analytics

        Clickstream Data Analysis

        Learning from implicit user feedback

        Anonymous user clustering

        Learning to rank

        Inventory/Market Trends

        Rare Event detection

        Price Prediction

        Spam Content detection
+
    Data Analytics

        Challenges:
             “Good Deal” detection
             Recommendation Systems for Vertical Data with no explicit user
              feedback
             Accuracy of Automatic Valuation Models
             Data-driven feature design
             Click Prediction
             User Behavior Modeling
+
    Computational Advertising

        The central problem of computational advertising is to find
         the "best match" between a given user in a given context and a
         suitable advertisement.




    ads


                                                                          ads




                                         search results !
+
    Computational Advertising

        Vertical Search presents an additional challenge in the sense
         that any of the actual search results can be “sponsored”




                                                                   ad ?




                                                                   ad ?
+
    Computational Advertising

        Central challenge:
             Find the “best match” between a given user in a given context
              and a suitable advertisement
             “best match” – maximizing the value for :
                  Users
                  Advertisers
                  Publishers
             Each of the parties has different set of utilities:
                Users want relevance

                  Advertisers want ROI and volume
                  Publishers want revenue per impression/search
+
    Computational Advertising

        CTR (ClickThrough Rate Estimation):
             Reactive (statistically significant historical CTR)
             Predictive (CTR estimated from features of ads)
             Hybrid (historical + predictive)


             Personalization of CTR Computation ?
             Dynamic CTR Estimation (online algorithms)




                                  P(click) = ?
+
    Computational Advertising

        Analytical Aparatus:
             Regression Analysis (Linear, Logistic, probit model, High
              Dimensional methods)
             Game Theory (Nash Equilibria, dominant strategy)
             Auction Theory (Vickrey, GSP, VCG…)
             Graph Theory (random walks on graphs, graph matching, etc.)
             Information Retrieval Techniques (similarity metrics, etc.)
             …
+
    Conclusion

        Vertical Search & Analytics at Web Scale == fun !!!

        Source of large number of relevant research & engineering
         problems !

        Opportunity to tackle wide spectra of techniques across all
         areas of Computer Science and Engineering !




                                       Jump on the bandwagon ! : )

More Related Content

Similar to Engineering Challenges in Vertical Search Engines

SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYSEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYAmit Sheth
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchYury Lifshits
 
Building Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsBuilding Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsOlha Hrytsay
 
Semantic Web Technologies
Semantic Web TechnologiesSemantic Web Technologies
Semantic Web TechnologiesKANIMOZHIUMA
 
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarialÓscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarialFundación Ramón Areces
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayAmit Sheth
 
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...Amazon Web Services
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studydeep.bi
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
Building a Real-Time Geospatial-Aware Recommendation Engine
 Building a Real-Time Geospatial-Aware Recommendation Engine Building a Real-Time Geospatial-Aware Recommendation Engine
Building a Real-Time Geospatial-Aware Recommendation EngineAmazon Web Services
 
Liquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the WebLiquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the WebAlessandro Bozzon
 
Introduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWSIntroduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWSAmazon Web Services
 
webmining overview
webmining overviewwebmining overview
webmining overviewabon
 
Data Science, Personalisation & Product management
Data Science, Personalisation & Product managementData Science, Personalisation & Product management
Data Science, Personalisation & Product managementBhaskar Krishnan
 
Data-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdfData-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdfParvathyparu25
 
Big Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website AnalyticsBig Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website Analyticsdeep.bi
 
Semantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information SystemsSemantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information SystemsAmit Sheth
 
SLA Nov2009 Public
SLA Nov2009 PublicSLA Nov2009 Public
SLA Nov2009 Publicaspoerri
 
Ranking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphRanking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphBill Slawski
 

Similar to Engineering Challenges in Vertical Search Engines (20)

SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYSEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! Research
 
Building Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsBuilding Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data Platforms
 
Semantic Web Technologies
Semantic Web TechnologiesSemantic Web Technologies
Semantic Web Technologies
 
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarialÓscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World Today
 
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Building a Real-Time Geospatial-Aware Recommendation Engine
 Building a Real-Time Geospatial-Aware Recommendation Engine Building a Real-Time Geospatial-Aware Recommendation Engine
Building a Real-Time Geospatial-Aware Recommendation Engine
 
Liquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the WebLiquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the Web
 
Introduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWSIntroduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWS
 
webmining overview
webmining overviewwebmining overview
webmining overview
 
Data Science, Personalisation & Product management
Data Science, Personalisation & Product managementData Science, Personalisation & Product management
Data Science, Personalisation & Product management
 
Data-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdfData-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdf
 
Big Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website AnalyticsBig Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website Analytics
 
Semantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information SystemsSemantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information Systems
 
SLA Nov2009 Public
SLA Nov2009 PublicSLA Nov2009 Public
SLA Nov2009 Public
 
Ranking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphRanking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge Graph
 

More from ITDogadjaji.com

Supporting clusters in Serbia
Supporting clusters in SerbiaSupporting clusters in Serbia
Supporting clusters in SerbiaITDogadjaji.com
 
Outsourcing Center Serbia
Outsourcing Center SerbiaOutsourcing Center Serbia
Outsourcing Center SerbiaITDogadjaji.com
 
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...ITDogadjaji.com
 
How to Web 2011 Event Presentation
How to Web 2011 Event PresentationHow to Web 2011 Event Presentation
How to Web 2011 Event PresentationITDogadjaji.com
 
Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities ITDogadjaji.com
 
ShoutEm - It's alright to pivot
ShoutEm - It's alright to pivotShoutEm - It's alright to pivot
ShoutEm - It's alright to pivotITDogadjaji.com
 
How to deal with the media without screwing up
How to deal with the media without screwing upHow to deal with the media without screwing up
How to deal with the media without screwing upITDogadjaji.com
 
VC 101: getting to first base
VC 101: getting to first baseVC 101: getting to first base
VC 101: getting to first baseITDogadjaji.com
 
From Ljubljana into the world
From Ljubljana into the worldFrom Ljubljana into the world
From Ljubljana into the worldITDogadjaji.com
 
How to Web 2010 - Event presentation
How to Web 2010 - Event presentationHow to Web 2010 - Event presentation
How to Web 2010 - Event presentationITDogadjaji.com
 

More from ITDogadjaji.com (20)

Game Design 101
Game Design 101Game Design 101
Game Design 101
 
Uvod u Gejmifikaciju
Uvod u GejmifikacijuUvod u Gejmifikaciju
Uvod u Gejmifikaciju
 
Supporting clusters in Serbia
Supporting clusters in SerbiaSupporting clusters in Serbia
Supporting clusters in Serbia
 
Outsourcing Center Serbia
Outsourcing Center SerbiaOutsourcing Center Serbia
Outsourcing Center Serbia
 
ICT Clusters
ICT ClustersICT Clusters
ICT Clusters
 
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
 
How to Web 2011 Event Presentation
How to Web 2011 Event PresentationHow to Web 2011 Event Presentation
How to Web 2011 Event Presentation
 
Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities
 
Mobipatrol
MobipatrolMobipatrol
Mobipatrol
 
Mediatoolkit
MediatoolkitMediatoolkit
Mediatoolkit
 
Taksiko
TaksikoTaksiko
Taksiko
 
SiteCake
SiteCakeSiteCake
SiteCake
 
ShoutEm - It's alright to pivot
ShoutEm - It's alright to pivotShoutEm - It's alright to pivot
ShoutEm - It's alright to pivot
 
How to (Win on the) Web
How to (Win on the) WebHow to (Win on the) Web
How to (Win on the) Web
 
How to deal with the media without screwing up
How to deal with the media without screwing upHow to deal with the media without screwing up
How to deal with the media without screwing up
 
VC 101: getting to first base
VC 101: getting to first baseVC 101: getting to first base
VC 101: getting to first base
 
birthdaysRock.com
birthdaysRock.combirthdaysRock.com
birthdaysRock.com
 
From Ljubljana into the world
From Ljubljana into the worldFrom Ljubljana into the world
From Ljubljana into the world
 
How to Web 2010 - Event presentation
How to Web 2010 - Event presentationHow to Web 2010 - Event presentation
How to Web 2010 - Event presentation
 
Ekspertlink
EkspertlinkEkspertlink
Ekspertlink
 

Recently uploaded

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 

Recently uploaded (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 

Engineering Challenges in Vertical Search Engines

  • 1. + Engineering Challenges in Vertical Search Engines Aleksandar Bradic, Senior Director, Engineering and R&D, Vast.com
  • 2. + Introduction   Vertical Search   Search focused on vertical data   Vertical Data – data inherently described by it’s structure:   Items/Properties for sale (Automotive, Real Estate..)   Geographical Data (Neighborhoods, Locations..)   Services (Hotels, Transportation..)   Businesses (Restaurants, Nightlife..)   Events (Concerts, Plays..)   Auction items (Collectibles, Art..)   Metadata (News, Social Data, Reviews..)   …
  • 3. + Introduction   Vertical Search != Full Text Search   Full Text Search queries:   “Cheap tickets for Broadway shows this week”   “Trendy Restaurants in San Francisco near SoMa”   “3-day trips from NYC to anywhere under $1000”   Vertical Search queries:   “price-sorted results bellow two standard deviations from tickets category with Broadway as location and date range of 2010-04-11 to 2010-04-18”   “distance-sorted results relative to center of SF/SoMa matching the appropriate threshold of composite score of user review scores and historical change in query/review volume”   “total cost-sorted results for all 3-day intervals within next 6 months combining hotel and airfare price bellow max value of $1000 for all valid locations”
  • 4. + Introduction   Vertical Search = search on structured data   Vertical Search at Web-Scale:   Web-Scale datasets   Web-Scale query volumes   Interactive operation   Low latency requirements   Utility maximization across all involved parties   => loads of fun ! : )
  • 5. + @Vast.com   Vast.com : Vertical Search & Analytics Platform   Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest Airlines, etc..
  • 6. + @Vast.com   Daily processing up to 1Tb of unstructured and semi- structured Web data   Managing ~150M records operational dataset across multiple verticals   Handling > 1000 query/sec peak search query loads   We’re hiring ! : )
  • 7. + Challenges in Vertical Search Engines   Web Data Retrieval   Unstructured Data   Data Processing Infrastructures   Vertical Search   Data Analytics   Computational Advertising
  • 8. + Web Data Retrieval   Crawler Architecture   Queue Management   Crawl Ordering Policies   Duplicate URL Detection   Content Hash Management   Politeness Management   Coverage Measurement   Freshness Optimization   Incremental Crawling
  • 9. + Web Data Retrieval   ”Deep Web” crawling   Locating Deep Web Content Sources   Selecting Relevant Sources   Estimating Database Size   Understanding Content / Form Detection   Automatic Dispatch of HTML Forms   Predicting content in free text forms   Crawling non-HTML Content   Estimating Query Result Sparsity   URL Generation problem   Query Covering Problem
  • 10. + Web Data Retrieval   Focused (Topical) Crawling   Content Classification   Link Content Prediction   Topic Relevance Estimation   Modeling Temporal Characteristics   Site-Level Evolution   Page-Level Evolution   Adversarial Crawling   Web Spam Detection   Cloaked Content Detection
  • 11. + Unstructured Data   Unstructured Data – information that does not have a pre- defined data model   Handling Unstructured Data:   Data Cleaning   Tagging with Metadata   Vertical Classification   Schema Matching   Information Extraction Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!! Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!! make model year trim price ???
  • 12. + Unstructured Data   Information extraction from unstructured, ungrammatical data   Reference Sets - relational data sets that consist of collection of known entities with associated common attributes   Reference Set Selection   Reference Set Generation   Record Linkage : Finding “best matching” member of reference set corresponding post   Challenge : Automatic Generation of Reference Sets
  • 13. + Data Processing Infrastructures   Infrastructures for continuous processing of unbounded streams of unstructured data   Information Extraction as part of processing (non-trivial computation per each processed entry)   Inherently distributed infrastructures - in order to support performance and scalability   Time-to-site constraints. Ability to process out-of band data.   Support for complex operations on aggregated data (de- duplication, static ranking, data enrichment, data cleaning/ filtering …)   Support for data archival and off-line analysis
  • 14. + Data Processing Infrastructures
  • 15. + Data Processing Infrastructures   Distributed Computing Platforms:   Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)   Stream-oriented (Flume, S4, Stream SQL…)   Distributed Data Stores (Dynamo/Cassandra/Riak…)   The curse of CAP Theorem:   It is impossible for a distributed system to simultaneously provide all three of the following guarantees:   Consistency   Availability   Partition tolerance
  • 16. + Vertical Search   Large-Scale structured data search   Providing both analytic and canonical set of Information Retrieval functionalities   Entries are represented in Vector Space Model   Each result is represented as data point – tuple consisting of appropriate number of fields : (make, model, year, trim …)
  • 17. + Vertical Search   Search in Vector Space Model   Resulting subset generation   Sorting as linearization using selected metric   Dynamic subset criteria calculation   Search Result Clustering   “Similar” result search   … … with up to ~100 ms milliseconds response time … at 10M+ records in index … handling 100+ queries/sec/host
  • 18. + Vertical Search   Faceted Search   fac-et (fas’it) :   1. One of the flat polished surfaces cut on a gemstone or occurring naturally on a crystal.   2. One of numerous aspects, as of a subject.   Vocabulary problem for faceted data   Facet Design / selection   "the keywords that are assigned by indexers are often at odds with those tried by searchers.”   Selection of information-distinguishing facet values   User-specific faceted search   Dynamic correlated facet generation   Distributing facet computation
  • 19. + Data Analytics   Clickstream Data Analysis   Learning from implicit user feedback   Anonymous user clustering   Learning to rank   Inventory/Market Trends   Rare Event detection   Price Prediction   Spam Content detection
  • 20. + Data Analytics   Challenges:   “Good Deal” detection   Recommendation Systems for Vertical Data with no explicit user feedback   Accuracy of Automatic Valuation Models   Data-driven feature design   Click Prediction   User Behavior Modeling
  • 21. + Computational Advertising   The central problem of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement. ads ads search results !
  • 22. + Computational Advertising   Vertical Search presents an additional challenge in the sense that any of the actual search results can be “sponsored” ad ? ad ?
  • 23. + Computational Advertising   Central challenge:   Find the “best match” between a given user in a given context and a suitable advertisement   “best match” – maximizing the value for :   Users   Advertisers   Publishers   Each of the parties has different set of utilities:   Users want relevance   Advertisers want ROI and volume   Publishers want revenue per impression/search
  • 24. + Computational Advertising   CTR (ClickThrough Rate Estimation):   Reactive (statistically significant historical CTR)   Predictive (CTR estimated from features of ads)   Hybrid (historical + predictive)   Personalization of CTR Computation ?   Dynamic CTR Estimation (online algorithms) P(click) = ?
  • 25. + Computational Advertising   Analytical Aparatus:   Regression Analysis (Linear, Logistic, probit model, High Dimensional methods)   Game Theory (Nash Equilibria, dominant strategy)   Auction Theory (Vickrey, GSP, VCG…)   Graph Theory (random walks on graphs, graph matching, etc.)   Information Retrieval Techniques (similarity metrics, etc.)   …
  • 26. + Conclusion   Vertical Search & Analytics at Web Scale == fun !!!   Source of large number of relevant research & engineering problems !   Opportunity to tackle wide spectra of techniques across all areas of Computer Science and Engineering ! Jump on the bandwagon ! : )